From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 18:22:18 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 888BC106567A; Sun, 22 Aug 2010 18:22:17 +0000 (UTC) (envelope-from mike@ducky.net) Received: from ducky.net (ducky.net [71.216.22.205]) by mx1.freebsd.org (Postfix) with ESMTP id 4E4228FC14; Sun, 22 Aug 2010 18:22:16 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ducky.net (8.14.3/8.14.3) with ESMTP id o7MIMCRN050408; Sun, 22 Aug 2010 11:22:12 -0700 (PDT) (envelope-from mike@ducky.net) Message-Id: <201008221822.o7MIMCRN050408@ducky.net> X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2 To: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= In-reply-to: <86k4nikglg.fsf@ds4.des.no> References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> Comments: In-reply-to =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= message dated "Sun, 22 Aug 2010 13:54:35 +0200." Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Sun, 22 Aug 2010 11:22:12 -0700 From: Mike Haertel X-Mailman-Approved-At: Sun, 22 Aug 2010 19:18:51 +0000 Cc: freebsd-current@freebsd.org, gabor@freebsd.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 18:22:18 -0000 Dag-Erling Sm=F8rgrav writes: > Mike Haertel writes: > > GNU grep uses the well-known Boyer-Moore algorithm, which looks > > first for the final letter of the target string, and uses a lookup > > table to tell it how far ahead it can skip in the input whenever > > it finds a non-matching character. >=20 > Boyer-Moore is for fixed search strings. I don't see how that > optimization can work with a regexp search unless the regexp is so > simple that you break it down into a small number of cases with known > length and final character. GNU grep uses heuristics to look for a fixed string that any string matching the regex *must* contain, and uses that fixed string as the bases for its initial Boyer-Moore search. For example if your regex is /foo.*bar/, the initial Boyer-Moore search is (probably) searching for foo. If the initial search succeeds, GNU grep isolates the containing line, and then runs the full regex matcher on that line to make sure. This is the sort of thing that a good regex library could do internally. Unfortunately, you can'd do this with a library that conforms to the =21=40=23%=24=21=40=23% POSIX regex API. The problem is that regexec()'s interface is based on NUL-terminated strings, rather than byte-counted buffers. So POSIX regexec() is necessarily character-at-a-time, because it has to look for that input-terminating NUL byte, and also you can't use it to search binary data that might contain NULs. (GNU grep works fine with arbitrary input files, as long as it can allocate enough memory to hold the longest line.) For these reasons a good grep implementation is pretty muched doomed to bundle its own regex matcher.