From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 19:53:40 2010 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 323DF10656AA; Sun, 22 Aug 2010 19:53:40 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id E6C498FC18; Sun, 22 Aug 2010 19:53:39 +0000 (UTC) Received: from ds4.des.no (des.no [84.49.246.2]) by smtp.des.no (Postfix) with ESMTP id E2CB31FFC37; Sun, 22 Aug 2010 19:53:38 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 9D4648456E; Sun, 22 Aug 2010 21:53:38 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: "Sean C. Farley" References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> Date: Sun, 22 Aug 2010 21:53:38 +0200 In-Reply-To: (Sean C. Farley's message of "Sun, 22 Aug 2010 11:30:01 -0500 (CDT)") Message-ID: <861v9qmnjx.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-current@FreeBSD.org, Mike Haertel , gabor@FreeBSD.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 19:53:40 -0000 "Sean C. Farley" writes: > Some algorithms: > 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm Aho-Corasick is not really a search algorithm, but an algorithm for constructing a table-driven finite state machine that will match either of the search strings you fed it. I believe it is less efficient than Boyer-Moore for small numbers of search terms, since it scans the entire input. I don't see the point in using it in grep, because grep already has an algorithm for constructing finite state machines: regcomp(3). > 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm It doesn't seem to compare favorably to the far older Aho-Corasick. It uses slightly less memory, but memory is usually not an issue with grep. > 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore > variant) Glimpse is a POS... and not really comparable, because grep is designed to search for a single search string in multiple texts, while glimpse is designed to search a large amount of text over and over with different search strings. I believe it uses suffix tables to construct its index, and Boyer-Moore only to locate specific matches, since the index lists only files, not exact positions. For anything other than fixed strings, it reverts to agrep, but I assume (I haven't looked at the code) that if the regexp has one or more fixed components, it uses those to narrow the search space before running agrep. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no