From owner-freebsd-current@FreeBSD.ORG  Sun Aug 22 19:53:40 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 323DF10656AA;
	Sun, 22 Aug 2010 19:53:40 +0000 (UTC) (envelope-from des@des.no)
Received: from smtp.des.no (smtp.des.no [194.63.250.102])
	by mx1.freebsd.org (Postfix) with ESMTP id E6C498FC18;
	Sun, 22 Aug 2010 19:53:39 +0000 (UTC)
Received: from ds4.des.no (des.no [84.49.246.2])
	by smtp.des.no (Postfix) with ESMTP id E2CB31FFC37;
	Sun, 22 Aug 2010 19:53:38 +0000 (UTC)
Received: by ds4.des.no (Postfix, from userid 1001)
	id 9D4648456E; Sun, 22 Aug 2010 21:53:38 +0200 (CEST)
From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To: "Sean C. Farley" <scf@FreeBSD.org>
References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no>
	<alpine.BSF.2.00.1008221111300.1989@terminus>
Date: Sun, 22 Aug 2010 21:53:38 +0200
In-Reply-To: <alpine.BSF.2.00.1008221111300.1989@terminus> (Sean C. Farley's
	message of "Sun, 22 Aug 2010 11:30:01 -0500 (CDT)")
Message-ID: <861v9qmnjx.fsf@ds4.des.no>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-current@FreeBSD.org, Mike Haertel <mike@ducky.net>,
	gabor@FreeBSD.org
Subject: Re: why GNU grep is fast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Aug 2010 19:53:40 -0000

"Sean C. Farley" <scf@FreeBSD.org> writes:
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm

Aho-Corasick is not really a search algorithm, but an algorithm for
constructing a table-driven finite state machine that will match either
of the search strings you fed it.  I believe it is less efficient than
Boyer-Moore for small numbers of search terms, since it scans the entire
input.  I don't see the point in using it in grep, because grep already
has an algorithm for constructing finite state machines: regcomp(3).

> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

It doesn't seem to compare favorably to the far older Aho-Corasick.  It
uses slightly less memory, but memory is usually not an issue with grep.

> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
> variant)

Glimpse is a POS...  and not really comparable, because grep is designed
to search for a single search string in multiple texts, while glimpse is
designed to search a large amount of text over and over with different
search strings.  I believe it uses suffix tables to construct its index,
and Boyer-Moore only to locate specific matches, since the index lists
only files, not exact positions.  For anything other than fixed strings,
it reverts to agrep, but I assume (I haven't looked at the code) that if
the regexp has one or more fixed components, it uses those to narrow the
search space before running agrep.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no