From owner-freebsd-questions@FreeBSD.ORG Fri Jul 2 20:33:10 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 45BD1106564A for ; Fri, 2 Jul 2010 20:33:10 +0000 (UTC) (envelope-from freebsd-questions-local@be-well.ilk.org) Received: from mail6.sea5.speakeasy.net (mail6.sea5.speakeasy.net [69.17.117.50]) by mx1.freebsd.org (Postfix) with ESMTP id 1DD808FC18 for ; Fri, 2 Jul 2010 20:33:09 +0000 (UTC) Received: (qmail 7063 invoked from network); 2 Jul 2010 20:33:09 -0000 Received: from dsl092-078-145.bos1.dsl.speakeasy.net (HELO be-well.ilk.org) ([66.92.78.145]) (envelope-sender ) by mail6.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 2 Jul 2010 20:33:09 -0000 Received: by be-well.ilk.org (Postfix, from userid 1147) id E8B4E5084D; Fri, 2 Jul 2010 16:33:06 -0400 (EDT) From: Lowell Gilbert To: Tim Daneliuk References: <4C2DF07F.1020509@tundraware.com> <44630xq527.fsf@be-well.ilk.org> <20100702173504.c53738b2.freebsd@edvax.de> <44r5jln3oj.fsf@be-well.ilk.org> <20100702204249.1a7423ac.freebsd@edvax.de> <4C2E3AA3.7080200@tundraware.com> Date: Fri, 02 Jul 2010 16:33:06 -0400 In-Reply-To: <4C2E3AA3.7080200@tundraware.com> (Tim Daneliuk's message of "Fri, 02 Jul 2010 14:14:43 -0500") Message-ID: <448w5taakd.fsf@be-well.ilk.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-questions@freebsd.org Subject: Re: 'file' Command Giving False Positives X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Jul 2010 20:33:10 -0000 Tim Daneliuk writes: > At this point, I'm inclined to believe that 'file' alone is > insufficient to do this and, at best - even with more tools - > it's going to be a probabilities game - i.e. "What percentage > of false positives is acceptable?" file(1) is only intended to be a set of heuristics. It has a remarkably good set of heuristics at this point, but you're right that this cannot be solved simply by analyzing the contents of the files. For use in a system that you expect to scale, you will always be better off keeping meta-data in some other form (if you can, which is frequently not possible). If the whole data path is under your (customer's) control, it's not so hard; you can use file names, or put every file into a tar file along with a text file that indicates the data type, and on and on through as many approaches as you have the time to dream up. [If my examples are unclear, I can expand on them to make the point better.] This is made considerably worse by the fact that you've said that your files are encrypted. Some forms of encryption store some meta-data at a known place (like first) in the file, but generally this won't be the case. Now consider that there is a finite chance of running into a combination of cleartext, encryption, and password that you end up with an encrypted file that happens to have exactly the same contents as /bin/ls (it's vanishingly unlikely that this exact scenario would happen, but it's a good illustration of the problem). All of which is just agreeing with your suggestion that it's a "probabilities game" of reducing the error rate to acceptability; UNLESS you can control some other source of information. For an example of the latter, I have a backup file from this morning, named "be-well.100702._usr.l2.dump.gz.idea". If the files are coming in from the outside (untrustworthy input), you can't do this. One thing you *could* do in that case is use a custom magic(5) file for this application. You may well not care about input that really is an MS-DOS executable, so you can remove the patterns for all of them. Or AmigaOS, or laser printer firmware, or... Anyway, good luck.