Date: Sun, 22 Aug 2010 11:22:12 -0700 From: Mike Haertel <mike@ducky.net> To: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no> Cc: freebsd-current@freebsd.org, gabor@freebsd.org Subject: Re: why GNU grep is fast Message-ID: <201008221822.o7MIMCRN050408@ducky.net> In-Reply-To: <86k4nikglg.fsf@ds4.des.no> References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no>
next in thread | previous in thread | raw e-mail | index | archive | help
Dag-Erling Sm=F8rgrav <des=40des.no> writes: > Mike Haertel <mike=40ducky.net> writes: > > GNU grep uses the well-known Boyer-Moore algorithm, which looks > > first for the final letter of the target string, and uses a lookup > > table to tell it how far ahead it can skip in the input whenever > > it finds a non-matching character. >=20 > Boyer-Moore is for fixed search strings. I don't see how that > optimization can work with a regexp search unless the regexp is so > simple that you break it down into a small number of cases with known > length and final character. GNU grep uses heuristics to look for a fixed string that any string matching the regex *must* contain, and uses that fixed string as the bases for its initial Boyer-Moore search. For example if your regex is /foo.*bar/, the initial Boyer-Moore search is (probably) searching for foo. If the initial search succeeds, GNU grep isolates the containing line, and then runs the full regex matcher on that line to make sure. This is the sort of thing that a good regex library could do internally. Unfortunately, you can'd do this with a library that conforms to the =21=40=23%=24=21=40=23% POSIX regex API. The problem is that regexec()'s interface is based on NUL-terminated strings, rather than byte-counted buffers. So POSIX regexec() is necessarily character-at-a-time, because it has to look for that input-terminating NUL byte, and also you can't use it to search binary data that might contain NULs. (GNU grep works fine with arbitrary input files, as long as it can allocate enough memory to hold the longest line.) For these reasons a good grep implementation is pretty muched doomed to bundle its own regex matcher.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201008221822.o7MIMCRN050408>