From owner-freebsd-questions@FreeBSD.ORG  Fri Jul  2 20:33:10 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 45BD1106564A
	for <freebsd-questions@freebsd.org>;
	Fri,  2 Jul 2010 20:33:10 +0000 (UTC)
	(envelope-from freebsd-questions-local@be-well.ilk.org)
Received: from mail6.sea5.speakeasy.net (mail6.sea5.speakeasy.net
	[69.17.117.50]) by mx1.freebsd.org (Postfix) with ESMTP id 1DD808FC18
	for <freebsd-questions@freebsd.org>;
	Fri,  2 Jul 2010 20:33:09 +0000 (UTC)
Received: (qmail 7063 invoked from network); 2 Jul 2010 20:33:09 -0000
Received: from dsl092-078-145.bos1.dsl.speakeasy.net (HELO be-well.ilk.org)
	([66.92.78.145])
	(envelope-sender <freebsd-questions-local@be-well.ilk.org>)
	by mail6.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP
	for <freebsd-questions@freebsd.org>; 2 Jul 2010 20:33:09 -0000
Received: by be-well.ilk.org (Postfix, from userid 1147)
	id E8B4E5084D; Fri,  2 Jul 2010 16:33:06 -0400 (EDT)
From: Lowell Gilbert <freebsd-questions-local@be-well.ilk.org>
To: Tim Daneliuk <tundra@tundraware.com>
References: <4C2DF07F.1020509@tundraware.com> <44630xq527.fsf@be-well.ilk.org>
	<20100702173504.c53738b2.freebsd@edvax.de>
	<44r5jln3oj.fsf@be-well.ilk.org>
	<20100702204249.1a7423ac.freebsd@edvax.de>
	<4C2E3AA3.7080200@tundraware.com>
Date: Fri, 02 Jul 2010 16:33:06 -0400
In-Reply-To: <4C2E3AA3.7080200@tundraware.com> (Tim Daneliuk's message of
	"Fri, 02 Jul 2010 14:14:43 -0500")
Message-ID: <448w5taakd.fsf@be-well.ilk.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: freebsd-questions@freebsd.org
Subject: Re: 'file' Command Giving False Positives
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Jul 2010 20:33:10 -0000

Tim Daneliuk <tundra@tundraware.com> writes:

> At this point, I'm inclined to believe that 'file' alone is
> insufficient to do this and, at best - even with more tools -
> it's going to be a probabilities game - i.e. "What percentage
> of false positives is acceptable?"

file(1) is only intended to be a set of heuristics.  It has a remarkably
good set of heuristics at this point, but you're right that this cannot
be solved simply by analyzing the contents of the files.  For use in a
system that you expect to scale, you will always be better off keeping
meta-data in some other form (if you can, which is frequently not
possible).  If the whole data path is under your (customer's) control,
it's not so hard; you can use file names, or put every file into a tar
file along with a text file that indicates the data type, and on and on
through as many approaches as you have the time to dream up.  [If my
examples are unclear, I can expand on them to make the point better.]

This is made considerably worse by the fact that you've said that your
files are encrypted.  Some forms of encryption store some meta-data at a
known place (like first) in the file, but generally this won't be the
case.  Now consider that there is a finite chance of running into a
combination of cleartext, encryption, and password that you end up with
an encrypted file that happens to have exactly the same contents as
/bin/ls (it's vanishingly unlikely that this exact scenario would
happen, but it's a good illustration of the problem).  

All of which is just agreeing with your suggestion that it's a
"probabilities game" of reducing the error rate to acceptability; UNLESS
you can control some other source of information.  For an example of the
latter, I have a backup file from this morning, named
"be-well.100702._usr.l2.dump.gz.idea".  If the files are coming in from
the outside (untrustworthy input), you can't do this.  One thing you
*could* do in that case is use a custom magic(5) file for this
application.  You may well not care about input that really is an MS-DOS
executable, so you can remove the patterns for all of them.  Or AmigaOS,
or laser printer firmware, or...

Anyway, good luck.