From owner-freebsd-current@FreeBSD.ORG  Sun Aug 22 15:26:19 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E0F63106564A
	for <current@freebsd.org>; Sun, 22 Aug 2010 15:26:19 +0000 (UTC)
	(envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu (hergotha.csail.mit.edu [66.92.79.170])
	by mx1.freebsd.org (Postfix) with ESMTP id 928CF8FC17
	for <current@freebsd.org>; Sun, 22 Aug 2010 15:26:19 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
	by hergotha.csail.mit.edu (8.14.4/8.14.4) with ESMTP id o7MF2bvm086739; 
	Sun, 22 Aug 2010 11:02:37 -0400 (EDT)
	(envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
	by hergotha.csail.mit.edu (8.14.4/8.14.4/Submit) id o7MF2bfZ086738;
	Sun, 22 Aug 2010 11:02:37 -0400 (EDT) (envelope-from wollman)
Date: Sun, 22 Aug 2010 11:02:37 -0400 (EDT)
From: Garrett Wollman <wollman@hergotha.csail.mit.edu>
Message-Id: <201008221502.o7MF2bfZ086738@hergotha.csail.mit.edu>
To: des@des.no
In-Reply-To: <86k4nikglg.fsf@ds4.des.no>
References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no>
Organization: 
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.5
	(hergotha.csail.mit.edu [127.0.0.1]);
	Sun, 22 Aug 2010 11:02:37 -0400 (EDT)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
	autolearn=disabled version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	hergotha.csail.mit.edu
X-Mailman-Approved-At: Sun, 22 Aug 2010 15:41:38 +0000
Cc: current@freebsd.org
Subject: Re: why GNU grep is fast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Aug 2010 15:26:20 -0000

In article <86k4nikglg.fsf@ds4.des.no> you write:
>Mike Haertel <mike@ducky.net> writes:
>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>> first for the final letter of the target string, and uses a lookup
>> table to tell it how far ahead it can skip in the input whenever
>> it finds a non-matching character.
>
>Boyer-Moore is for fixed search strings.  I don't see how that
>optimization can work with a regexp search unless the regexp is so
>simple that you break it down into a small number of cases with known
>length and final character.

The common case of regexps used in the "grep" utility (and, for
obvious reasons, universal in the "fgrep" utility) is fixed-length
search strings.  Even non-fixed-length regexps typically consist of
one one or two variable-length parts.  Matching a completely
variable-length regexp is just hard, computationally, so it's OK for
it to be slower.  There are other tricks you can do, such as
turning the anchors ^ and $ into explicit newlines in your search --
"^foo" is a very common regexp to search for, and it's really a
fixed-string search for "\nfoo" which is entirely amenable to the B-M
treatment.  You just have to remember that a matched newline isn't
part of the result.

The GNU regexp library also uses the Boyer-Moore (or is it
Boyer-Moore-Gosper?) strategy.

-GAWollman