Date: Sun, 22 Aug 2010 10:20:03 -0700 From: Tim Kientzle <tim@kientzle.com> To: Sean C. Farley <scf@FreeBSD.org> Cc: =?iso-8859-1?Q?Dag-Erling_Sm=F8rgrav?= <des@des.no>, freebsd-current@FreeBSD.org, Mike Haertel <mike@ducky.net>, gabor@FreeBSD.org Subject: Re: why GNU grep is fast Message-ID: <628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com> In-Reply-To: <alpine.BSF.2.00.1008221111300.1989@terminus> References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <alpine.BSF.2.00.1008221111300.1989@terminus>
next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote: > On Sun, 22 Aug 2010, Dag-Erling Sm=F8rgrav wrote: >> Mike Haertel <mike@ducky.net> writes: >>> GNU grep uses the well-known Boyer-Moore algorithm, which looks = first for the final letter of the target string, and uses a lookup table = to tell it how far ahead it can skip in the input whenever it finds a = non-matching character. >>=20 >> Boyer-Moore is for fixed search strings. I don't see how that = optimization can work with a regexp search unless the regexp is so = simple that you break it down into a small number of cases with known = length and final character. >=20 > When I was working on making FreeGrep faster (years ago), I wrote down = a few notes about possible algorithms, especially those that could be = useful for fgrep functionality. I am just passing these onto the list. >=20 > Some algorithms: > 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm > 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm > 3. GNU fgrep: Commentz-Walter > 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore = variant) >=20 > Also, this may be a useful book: > http://www.dcc.uchile.cl/~gnavarro/FPMbook/ And of course, Russ Cox' excellent series of articles starting at: http://swtch.com/~rsc/regexp/regexp1.html Later on, he summarizes some of the existing implementations, including comments about the Plan 9 implementation and his own RE2, both of which efficiently handle international text (which seems to be a major concern of Gabor's). The key comment in Mike's GNU grep notes is the one about not breaking into lines. That's simply double-scanning the input; instead, run the matcher over blocks of text and, when it finds a match, work backwards from the match to find the appropriate line beginning. This is efficient because most lines don't match. Boyer-Moore is great for fixed strings (a very common use case for grep) and for more complex patterns that contain long fixed strings (helps to discard most lines early). Sophisticated regex matchers implement a number of strategies and choose different ones depending on the pattern. In the case of bsdgrep, it might make sense to use the regex library for the general case but implement a hand-tuned search for fixed strings that can be heavily optimized for that case. Of course, international text support complicates the picture; you have to consider the input character set (if you want to auto-detect Unicode encodings by looking for leading BOMs, for example, you either need to translate the fixed-string pattern to match the input encoding or vice-versa). Tim
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?628366E1-AF71-4A22-95AF-BC77A21C21A8>