From owner-freebsd-current@FreeBSD.ORG  Sun Aug 22 18:22:18 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 888BC106567A;
	Sun, 22 Aug 2010 18:22:17 +0000 (UTC) (envelope-from mike@ducky.net)
Received: from ducky.net (ducky.net [71.216.22.205])
	by mx1.freebsd.org (Postfix) with ESMTP id 4E4228FC14;
	Sun, 22 Aug 2010 18:22:16 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by ducky.net (8.14.3/8.14.3) with ESMTP id o7MIMCRN050408;
	Sun, 22 Aug 2010 11:22:12 -0700 (PDT) (envelope-from mike@ducky.net)
Message-Id: <201008221822.o7MIMCRN050408@ducky.net>
X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2
To: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
In-reply-to: <86k4nikglg.fsf@ds4.des.no> 
References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no>
Comments: In-reply-to =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
	message dated "Sun, 22 Aug 2010 13:54:35 +0200."
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Date: Sun, 22 Aug 2010 11:22:12 -0700
From: Mike Haertel <mike@ducky.net>
X-Mailman-Approved-At: Sun, 22 Aug 2010 19:18:51 +0000
Cc: freebsd-current@freebsd.org, gabor@freebsd.org
Subject: Re: why GNU grep is fast 
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Aug 2010 18:22:18 -0000

Dag-Erling Sm=F8rgrav <des=40des.no> writes:
> Mike Haertel <mike=40ducky.net> writes:
> > GNU grep uses the well-known Boyer-Moore algorithm, which looks
> > first for the final letter of the target string, and uses a lookup
> > table to tell it how far ahead it can skip in the input whenever
> > it finds a non-matching character.
>=20
> Boyer-Moore is for fixed search strings.  I don't see how that
> optimization can work with a regexp search unless the regexp is so
> simple that you break it down into a small number of cases with known
> length and final character.

GNU grep uses heuristics to look for a fixed string that any string
matching the regex *must* contain, and uses that fixed string as the
bases for its initial Boyer-Moore search.

For example if your regex is /foo.*bar/, the initial Boyer-Moore search
is (probably) searching for foo.

If the initial search succeeds, GNU grep isolates the containing line,
and then runs the full regex matcher on that line to make sure.

This is the sort of thing that a good regex library could do internally.

Unfortunately, you can'd do this with a library that conforms to
the =21=40=23%=24=21=40=23% POSIX regex API.

The problem is that regexec()'s interface is based on NUL-terminated
strings, rather than byte-counted buffers.  So POSIX regexec() is
necessarily character-at-a-time, because it has to look for that
input-terminating NUL byte, and also you can't use it to search binary
data that might contain NULs.  (GNU grep works fine with arbitrary
input files, as long as it can allocate enough memory to hold the
longest line.)

For these reasons a good grep implementation is pretty muched doomed
to bundle its own regex matcher.