Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Aug 2010 21:53:38 +0200
From:      =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To:        "Sean C. Farley" <scf@FreeBSD.org>
Cc:        freebsd-current@FreeBSD.org, Mike Haertel <mike@ducky.net>, gabor@FreeBSD.org
Subject:   Re: why GNU grep is fast
Message-ID:  <861v9qmnjx.fsf@ds4.des.no>
In-Reply-To: <alpine.BSF.2.00.1008221111300.1989@terminus> (Sean C. Farley's message of "Sun, 22 Aug 2010 11:30:01 -0500 (CDT)")
References:  <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <alpine.BSF.2.00.1008221111300.1989@terminus>

next in thread | previous in thread | raw e-mail | index | archive | help
"Sean C. Farley" <scf@FreeBSD.org> writes:
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm

Aho-Corasick is not really a search algorithm, but an algorithm for
constructing a table-driven finite state machine that will match either
of the search strings you fed it.  I believe it is less efficient than
Boyer-Moore for small numbers of search terms, since it scans the entire
input.  I don't see the point in using it in grep, because grep already
has an algorithm for constructing finite state machines: regcomp(3).

> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

It doesn't seem to compare favorably to the far older Aho-Corasick.  It
uses slightly less memory, but memory is usually not an issue with grep.

> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
> variant)

Glimpse is a POS...  and not really comparable, because grep is designed
to search for a single search string in multiple texts, while glimpse is
designed to search a large amount of text over and over with different
search strings.  I believe it uses suffix tables to construct its index,
and Boyer-Moore only to locate specific matches, since the index lists
only files, not exact positions.  For anything other than fixed strings,
it reverts to agrep, but I assume (I haven't looked at the code) that if
the regexp has one or more fixed components, it uses those to narrow the
search space before running agrep.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?861v9qmnjx.fsf>