From owner-freebsd-questions@FreeBSD.ORG Fri Aug 13 13:47:45 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1636D1065670 for ; Fri, 13 Aug 2010 13:47:45 +0000 (UTC) (envelope-from jacks@sage-american.com) Received: from mail.sagedata.net (mail.sagedata.net [38.106.15.121]) by mx1.freebsd.org (Postfix) with ESMTP id E830E8FC29 for ; Fri, 13 Aug 2010 13:47:44 +0000 (UTC) Received: from sagemaster (sageweb.net [65.68.247.73]) by mail.sagedata.net (8.14.4/8.14.4) with SMTP id o7DDlf2t027894; Fri, 13 Aug 2010 08:47:41 -0500 (CDT) (envelope-from jacks@sage-american.com) X-Authentication-Warning: mail.sagedata.net: Host sageweb.net [65.68.247.73] claimed to be sagemaster Message-Id: <3.0.1.32.20100813084738.00ee5c48@sage-american.com> X-Sender: jacks@sage-american.com X-Mailer: Windows Eudora Pro Version 3.0.1 (32) Date: Fri, 13 Aug 2010 08:47:38 -0500 To: Chip Camden , freebsd-questions@freebsd.org From: "Jack L. Stone" In-Reply-To: <20100812175614.GJ20504@libertas.local.camdensoftware.com> References: <867hjv92r2.fsf@gmail.com> <20100812153535.61549.qmail@joyce.lan> <201008121644.o7CGiflh099466@lurza.secnetix.de> <867hjv92r2.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Scanned-By: milter-spamc/1.15.388 (mail.sagedata.net [38.106.15.121]); Fri, 13 Aug 2010 08:47:42 -0500 X-Scanned-By: milter-sender/1.16.915 (mail.sagedata.net [38.106.15.121]); Fri, 13 Aug 2010 08:47:42 -0500 X-Spam-Status: NO, hits=-10.00 required=4.50 X-Spam-Report: Content analysis details: (-10.0 points, 4.5 required) | | pts rule name description | ---- ---------------------- -------------------------------------------------- | -0.0 SHORTCIRCUIT Not all rules were run, due to a shortcircuited rule | -10 ALL_TRUSTED Passed through trusted hosts only via SMTP | Cc: Subject: Re: Grepping a list of words X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Aug 2010 13:47:45 -0000 At 10:56 AM 8.12.2010 -0700, Chip Camden wrote: >Quoth Anonymous on Thursday, 12 August 2010: >> Oliver Fromme writes: >> >> > John Levine wrote: >> > > > > % egrep 'word1|word2|word3|...|wordn' filename.txt >> > > >> > > > Thanks for the replies. This suggestion won't do the job as the list of >> > > > words is very long, maybe 50-60. This is why I asked how to place them all >> > > > in a file. One reply dealt with using a file with egrep. I'll try that. >> > > >> > > Gee, 50 words, that's about a 300 character pattern, that's not a problem >> > > for any shell or version of grep I know. >> > > >> > > But reading the words from a file is equivalent and as you note most >> > > likely easier to do. >> > >> > The question is what is more efficient. This might be >> > important if that kind of grep command is run very often >> > by a script, or if it's run on very large files. >> > >> > My guess is that one large regular expression is more >> > efficient than many small ones. But I haven't done real >> > benchmarks to prove this. >> >> BTW, not using regular expressions is even more efficient, e.g. >> >> $ fgrep -f /usr/share/dict/words /etc/group >> >> When using egrep(1) it takes considerably more time and memory. > >Having written a regex engine myself, I can see why. Though I'm sure >egrep is highly optimized, even the most optimized DFA table is going to take more >cycles to navigate than a simple string comparison. Not to mention the >initial overhead of parsing the regex and building that table. > >-- >Sterling (Chip) Camden | sterling@camdensoftware.com | 2048D/3A978E4F Many thanks to all of the suggestions. I found this worked very well, ignoring concerns about use of resources: egrep -i -o -w -f word.file main.file The only thing it didn't do for me was the next step. My final objective was to really determine the words in the "word.file" that were not in the "main.file." I figured finding matches would be easy and then could then run a sort|uniq comparison to determine the "new words" not yet in the main.file. Since I will have a need to run this check frequently, any suggestions for a better approach are welcome. Thanks again... Jack (^_^) Happy trails, Jack L. Stone System Admin Sage-american