Date: Fri, 13 Aug 2010 08:47:38 -0500 From: "Jack L. Stone" <jacks@sage-american.com> To: Chip Camden <sterling@camdensoftware.com>, freebsd-questions@freebsd.org Subject: Re: Grepping a list of words Message-ID: <3.0.1.32.20100813084738.00ee5c48@sage-american.com> In-Reply-To: <20100812175614.GJ20504@libertas.local.camdensoftware.com> References: <867hjv92r2.fsf@gmail.com> <20100812153535.61549.qmail@joyce.lan> <201008121644.o7CGiflh099466@lurza.secnetix.de> <867hjv92r2.fsf@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
At 10:56 AM 8.12.2010 -0700, Chip Camden wrote: >Quoth Anonymous on Thursday, 12 August 2010: >> Oliver Fromme <olli@lurza.secnetix.de> writes: >> >> > John Levine <johnl@iecc.com> wrote: >> > > > > % egrep 'word1|word2|word3|...|wordn' filename.txt >> > > >> > > > Thanks for the replies. This suggestion won't do the job as the list of >> > > > words is very long, maybe 50-60. This is why I asked how to place them all >> > > > in a file. One reply dealt with using a file with egrep. I'll try that. >> > > >> > > Gee, 50 words, that's about a 300 character pattern, that's not a problem >> > > for any shell or version of grep I know. >> > > >> > > But reading the words from a file is equivalent and as you note most >> > > likely easier to do. >> > >> > The question is what is more efficient. This might be >> > important if that kind of grep command is run very often >> > by a script, or if it's run on very large files. >> > >> > My guess is that one large regular expression is more >> > efficient than many small ones. But I haven't done real >> > benchmarks to prove this. >> >> BTW, not using regular expressions is even more efficient, e.g. >> >> $ fgrep -f /usr/share/dict/words /etc/group >> >> When using egrep(1) it takes considerably more time and memory. > >Having written a regex engine myself, I can see why. Though I'm sure >egrep is highly optimized, even the most optimized DFA table is going to take more >cycles to navigate than a simple string comparison. Not to mention the >initial overhead of parsing the regex and building that table. > >-- >Sterling (Chip) Camden | sterling@camdensoftware.com | 2048D/3A978E4F Many thanks to all of the suggestions. I found this worked very well, ignoring concerns about use of resources: egrep -i -o -w -f word.file main.file The only thing it didn't do for me was the next step. My final objective was to really determine the words in the "word.file" that were not in the "main.file." I figured finding matches would be easy and then could then run a sort|uniq comparison to determine the "new words" not yet in the main.file. Since I will have a need to run this check frequently, any suggestions for a better approach are welcome. Thanks again... Jack (^_^) Happy trails, Jack L. Stone System Admin Sage-american
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3.0.1.32.20100813084738.00ee5c48>