Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 02 Jul 2010 16:33:06 -0400
From:      Lowell Gilbert <freebsd-questions-local@be-well.ilk.org>
To:        Tim Daneliuk <tundra@tundraware.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: 'file' Command Giving False Positives
Message-ID:  <448w5taakd.fsf@be-well.ilk.org>
In-Reply-To: <4C2E3AA3.7080200@tundraware.com> (Tim Daneliuk's message of "Fri, 02 Jul 2010 14:14:43 -0500")
References:  <4C2DF07F.1020509@tundraware.com> <44630xq527.fsf@be-well.ilk.org> <20100702173504.c53738b2.freebsd@edvax.de> <44r5jln3oj.fsf@be-well.ilk.org> <20100702204249.1a7423ac.freebsd@edvax.de> <4C2E3AA3.7080200@tundraware.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Tim Daneliuk <tundra@tundraware.com> writes:

> At this point, I'm inclined to believe that 'file' alone is
> insufficient to do this and, at best - even with more tools -
> it's going to be a probabilities game - i.e. "What percentage
> of false positives is acceptable?"

file(1) is only intended to be a set of heuristics.  It has a remarkably
good set of heuristics at this point, but you're right that this cannot
be solved simply by analyzing the contents of the files.  For use in a
system that you expect to scale, you will always be better off keeping
meta-data in some other form (if you can, which is frequently not
possible).  If the whole data path is under your (customer's) control,
it's not so hard; you can use file names, or put every file into a tar
file along with a text file that indicates the data type, and on and on
through as many approaches as you have the time to dream up.  [If my
examples are unclear, I can expand on them to make the point better.]

This is made considerably worse by the fact that you've said that your
files are encrypted.  Some forms of encryption store some meta-data at a
known place (like first) in the file, but generally this won't be the
case.  Now consider that there is a finite chance of running into a
combination of cleartext, encryption, and password that you end up with
an encrypted file that happens to have exactly the same contents as
/bin/ls (it's vanishingly unlikely that this exact scenario would
happen, but it's a good illustration of the problem).  

All of which is just agreeing with your suggestion that it's a
"probabilities game" of reducing the error rate to acceptability; UNLESS
you can control some other source of information.  For an example of the
latter, I have a backup file from this morning, named
"be-well.100702._usr.l2.dump.gz.idea".  If the files are coming in from
the outside (untrustworthy input), you can't do this.  One thing you
*could* do in that case is use a custom magic(5) file for this
application.  You may well not care about input that really is an MS-DOS
executable, so you can remove the patterns for all of them.  Or AmigaOS,
or laser printer firmware, or...

Anyway, good luck.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?448w5taakd.fsf>