From owner-freebsd-questions@FreeBSD.ORG  Fri Aug 13 13:47:45 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1636D1065670
	for <freebsd-questions@freebsd.org>;
	Fri, 13 Aug 2010 13:47:45 +0000 (UTC)
	(envelope-from jacks@sage-american.com)
Received: from mail.sagedata.net (mail.sagedata.net [38.106.15.121])
	by mx1.freebsd.org (Postfix) with ESMTP id E830E8FC29
	for <freebsd-questions@freebsd.org>;
	Fri, 13 Aug 2010 13:47:44 +0000 (UTC)
Received: from sagemaster (sageweb.net [65.68.247.73])
	by mail.sagedata.net (8.14.4/8.14.4) with SMTP id o7DDlf2t027894;
	Fri, 13 Aug 2010 08:47:41 -0500 (CDT)
	(envelope-from jacks@sage-american.com)
X-Authentication-Warning: mail.sagedata.net: Host sageweb.net [65.68.247.73]
	claimed to be sagemaster
Message-Id: <3.0.1.32.20100813084738.00ee5c48@sage-american.com>
X-Sender: jacks@sage-american.com
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
Date: Fri, 13 Aug 2010 08:47:38 -0500
To: Chip Camden <sterling@camdensoftware.com>, freebsd-questions@freebsd.org
From: "Jack L. Stone" <jacks@sage-american.com>
In-Reply-To: <20100812175614.GJ20504@libertas.local.camdensoftware.com>
References: <867hjv92r2.fsf@gmail.com> <20100812153535.61549.qmail@joyce.lan>
	<201008121644.o7CGiflh099466@lurza.secnetix.de>
	<867hjv92r2.fsf@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Scanned-By: milter-spamc/1.15.388 (mail.sagedata.net [38.106.15.121]);
	Fri, 13 Aug 2010 08:47:42 -0500
X-Scanned-By: milter-sender/1.16.915  (mail.sagedata.net [38.106.15.121]);
	Fri, 13 Aug 2010 08:47:42 -0500
X-Spam-Status: NO, hits=-10.00 required=4.50
X-Spam-Report: Content analysis details:   (-10.0 points, 4.5 required) | 
	|  pts rule name              description
	| ---- ----------------------
	--------------------------------------------------
	| -0.0 SHORTCIRCUIT           Not all rules were run,
	due to a shortcircuited rule
	| -10 ALL_TRUSTED Passed through trusted hosts only via SMTP | 
Cc: 
Subject: Re: Grepping a list of words
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Aug 2010 13:47:45 -0000

At 10:56 AM 8.12.2010 -0700, Chip Camden wrote:
>Quoth Anonymous on Thursday, 12 August 2010:
>> Oliver Fromme <olli@lurza.secnetix.de> writes:
>> 
>> > John Levine <johnl@iecc.com> wrote:
>> >  > > > % egrep 'word1|word2|word3|...|wordn' filename.txt
>> >  > 
>> >  > > Thanks for the replies. This suggestion won't do the job as the
list of
>> >  > > words is very long, maybe 50-60. This is why I asked how to place
them all
>> >  > > in a file. One reply dealt with using a file with egrep. I'll try
that.
>> >  > 
>> >  > Gee, 50 words, that's about a 300 character pattern, that's not a
problem
>> >  > for any shell or version of grep I know.
>> >  > 
>> >  > But reading the words from a file is equivalent and as you note most
>> >  > likely easier to do.
>> >
>> > The question is what is more efficient.  This might be
>> > important if that kind of grep command is run very often
>> > by a script, or if it's run on very large files.
>> >
>> > My guess is that one large regular expression is more
>> > efficient than many small ones.  But I haven't done real
>> > benchmarks to prove this.
>> 
>> BTW, not using regular expressions is even more efficient, e.g.
>> 
>>   $ fgrep -f /usr/share/dict/words /etc/group
>> 
>> When using egrep(1) it takes considerably more time and memory.
>
>Having written a regex engine myself, I can see why.  Though I'm sure
>egrep is highly optimized, even the most optimized DFA table is going to
take more
>cycles to navigate than a simple string comparison.  Not to mention the
>initial overhead of parsing the regex and building that table.
>
>-- 
>Sterling (Chip) Camden    | sterling@camdensoftware.com | 2048D/3A978E4F

Many thanks to all of the suggestions. I found this worked very well,
ignoring concerns about use of resources:

egrep -i -o -w -f word.file main.file

The only thing it didn't do for me was the next step. My final objective
was to really determine the words in the "word.file" that were not in the
"main.file." I figured finding matches would be easy and then could then
run a sort|uniq comparison to determine the "new words" not yet in the
main.file.

Since I will have a need to run this check frequently, any suggestions for
a better approach are welcome.

Thanks again...

Jack

(^_^)
Happy trails,
Jack L. Stone

System Admin
Sage-american