Date: Fri, 02 Jul 2010 16:33:06 -0400 From: Lowell Gilbert <freebsd-questions-local@be-well.ilk.org> To: Tim Daneliuk <tundra@tundraware.com> Cc: freebsd-questions@freebsd.org Subject: Re: 'file' Command Giving False Positives Message-ID: <448w5taakd.fsf@be-well.ilk.org> In-Reply-To: <4C2E3AA3.7080200@tundraware.com> (Tim Daneliuk's message of "Fri, 02 Jul 2010 14:14:43 -0500") References: <4C2DF07F.1020509@tundraware.com> <44630xq527.fsf@be-well.ilk.org> <20100702173504.c53738b2.freebsd@edvax.de> <44r5jln3oj.fsf@be-well.ilk.org> <20100702204249.1a7423ac.freebsd@edvax.de> <4C2E3AA3.7080200@tundraware.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Tim Daneliuk <tundra@tundraware.com> writes: > At this point, I'm inclined to believe that 'file' alone is > insufficient to do this and, at best - even with more tools - > it's going to be a probabilities game - i.e. "What percentage > of false positives is acceptable?" file(1) is only intended to be a set of heuristics. It has a remarkably good set of heuristics at this point, but you're right that this cannot be solved simply by analyzing the contents of the files. For use in a system that you expect to scale, you will always be better off keeping meta-data in some other form (if you can, which is frequently not possible). If the whole data path is under your (customer's) control, it's not so hard; you can use file names, or put every file into a tar file along with a text file that indicates the data type, and on and on through as many approaches as you have the time to dream up. [If my examples are unclear, I can expand on them to make the point better.] This is made considerably worse by the fact that you've said that your files are encrypted. Some forms of encryption store some meta-data at a known place (like first) in the file, but generally this won't be the case. Now consider that there is a finite chance of running into a combination of cleartext, encryption, and password that you end up with an encrypted file that happens to have exactly the same contents as /bin/ls (it's vanishingly unlikely that this exact scenario would happen, but it's a good illustration of the problem). All of which is just agreeing with your suggestion that it's a "probabilities game" of reducing the error rate to acceptability; UNLESS you can control some other source of information. For an example of the latter, I have a backup file from this morning, named "be-well.100702._usr.l2.dump.gz.idea". If the files are coming in from the outside (untrustworthy input), you can't do this. One thing you *could* do in that case is use a custom magic(5) file for this application. You may well not care about input that really is an MS-DOS executable, so you can remove the patterns for all of them. Or AmigaOS, or laser printer firmware, or... Anyway, good luck.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?448w5taakd.fsf>