From owner-freebsd-current@FreeBSD.ORG  Mon Aug 23 01:37:15 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E98031065679;
	Mon, 23 Aug 2010 01:37:14 +0000 (UTC) (envelope-from scf@FreeBSD.org)
Received: from mail.farley.org (mail.farley.org [IPv6:2001:470:1f0f:20:2::11])
	by mx1.freebsd.org (Postfix) with ESMTP id 96D3C8FC12;
	Mon, 23 Aug 2010 01:37:14 +0000 (UTC)
Received: from thor.farley.org (HPooka@thor.farley.org
	[IPv6:2001:470:1f0f:20:1::5])
	by mail.farley.org (8.14.4/8.14.4) with ESMTP id o7N1b9Lf071486;
	Sun, 22 Aug 2010 20:37:09 -0500 (CDT) (envelope-from scf@FreeBSD.org)
Date: Sun, 22 Aug 2010 20:37:09 -0500 (CDT)
From: "Sean C. Farley" <scf@FreeBSD.org>
To: Tim Kientzle <tim@kientzle.com>
In-Reply-To: <628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com>
Message-ID: <alpine.BSF.2.00.1008222030080.93799@thor.farley.org>
References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no>
	<alpine.BSF.2.00.1008221111300.1989@terminus>
	<628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="56599777-2042966741-1282527429=:93799"
X-Spam-Status: No, score=-1.3 required=4.0 tests=AWL,BAYES_00,SPF_SOFTFAIL
	autolearn=no version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.farley.org
Cc: =?ISO-8859-15?Q?Dag-Erling_Sm=F8rgrav?= <des@des.no>,
	freebsd-current@FreeBSD.org, Mike Haertel <mike@ducky.net>,
	gabor@FreeBSD.org
Subject: Re: why GNU grep is fast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 01:37:15 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--56599777-2042966741-1282527429=:93799
Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed
Content-Transfer-Encoding: 8BIT

On Sun, 22 Aug 2010, Tim Kientzle wrote:

> On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
>> On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
>>> Mike Haertel <mike@ducky.net> writes:
>>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks 
>>>> first for the final letter of the target string, and uses a lookup 
>>>> table to tell it how far ahead it can skip in the input whenever it 
>>>> finds a non-matching character.
>>>
>>> Boyer-Moore is for fixed search strings.  I don't see how that 
>>> optimization can work with a regexp search unless the regexp is so 
>>> simple that you break it down into a small number of cases with 
>>> known length and final character.
>>
>> When I was working on making FreeGrep faster (years ago), I wrote 
>> down a few notes about possible algorithms, especially those that 
>> could be useful for fgrep functionality.  I am just passing these 
>> onto the list.
>>
>> Some algorithms:
>> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
>> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
>> 3. GNU fgrep:  Commentz-Walter
>> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)
>>
>> Also, this may be a useful book:
>> http://www.dcc.uchile.cl/~gnavarro/FPMbook/
>
> And of course, Russ Cox' excellent series of articles starting at:
>
>  http://swtch.com/~rsc/regexp/regexp1.html

I saved that link from an E-mail earlier because it looked very 
interesting.

> Later on, he summarizes some of the existing implementations, 
> including comments about the Plan 9 implementation and his own RE2, 
> both of which efficiently handle international text (which seems to be 
> a major concern of Gabor's).

I believe Gabor is considering TRE for a good replacement regex library.

> The key comment in Mike's GNU grep notes is the one about not breaking 
> into lines.  That's simply double-scanning the input; instead, run the 
> matcher over blocks of text and, when it finds a match, work backwards 
> from the match to find the appropriate line beginning.  This is 
> efficient because most lines don't match.

I do like the idea.

> Boyer-Moore is great for fixed strings (a very common use case for 
> grep) and for more complex patterns that contain long fixed strings 
> (helps to discard most lines early).  Sophisticated regex matchers 
> implement a number of strategies and choose different ones depending 
> on the pattern.

That is what fastgrep (in bsdgrep) attempts to accomplish with very 
simply regex lines (beginning of line, end of line and dot).

> In the case of bsdgrep, it might make sense to use the regex library 
> for the general case but implement a hand-tuned search for fixed 
> strings that can be heavily optimized for that case.  Of course, 
> international text support complicates the picture; you have to 
> consider the input character set (if you want to auto-detect Unicode 
> encodings by looking for leading BOMs, for example, you either need to 
> translate the fixed-string pattern to match the input encoding or 
> vice-versa).

BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a 
faster search bypassing the regex library.  :)  It certainly was not 
written with any encodings in mind; it was purely ASCII.  As I have not 
kept up with it, I do not know if anyone improved it or not.

Sean
-- 
scf@FreeBSD.org
--56599777-2042966741-1282527429=:93799--