From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 07:56:33 2010 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6EAA510656A9; Mon, 23 Aug 2010 07:56:33 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 2DBA48FC27; Mon, 23 Aug 2010 07:56:32 +0000 (UTC) Received: from ds4.des.no (des.no [84.49.246.2]) by smtp.des.no (Postfix) with ESMTP id 8E07E1FFC37; Mon, 23 Aug 2010 07:56:31 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 5FD848452E; Mon, 23 Aug 2010 09:56:31 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: "Sean C. Farley" References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <861v9qmnjx.fsf@ds4.des.no> Date: Mon, 23 Aug 2010 09:56:30 +0200 In-Reply-To: (Sean C. Farley's message of "Sun, 22 Aug 2010 20:29:01 -0500 (CDT)") Message-ID: <86bp8tzrrl.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-current@FreeBSD.org, Mike Haertel , gabor@FreeBSD.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 07:56:33 -0000 "Sean C. Farley" writes: > Dag-Erling Sm=C3=B8rgrav writes: > > Aho-Corasick is not really a search algorithm, but an algorithm for > > constructing a table-driven finite state machine that will match > > either of the search strings you fed it. I believe it is less > > efficient than Boyer-Moore for small numbers of search terms, since > > it scans the entire input. I don't see the point in using it in > > grep, because grep already has an algorithm for constructing finite > > state machines: regcomp(3). > especially those that could be useful for fgrep functionality Not entirely sure what you mean by that, but in most cases, I think Boyer-Moore is a better fit for fgrep. For large numbers of search terms, I might use Aho-Corasick... if it weren't for the fact that, as mentioned, grep already has a regexp engine, which is a superset of Aho-Corasick. It might be a tad slower at building the FSA, because it's more general, and the FSA might occupy a tad more memory, but the complexity is *exactly* the same, and adding Aho-Corasick just for fgrep is, for lack of a better word, pedantry. For all you know, the increased code size could very well offset whatever performance advantage Aho-Corasick might offer. > Glimpse may be a POS; I have not used it personally. I only noted its > algorithm for possible use within fgrep. Apples and oranges. The problem glimpse tries to solve (fixed corpus, variable search terms) is precisely the opposite of the one fgrep tries to solve (fixed search terms, variable corpus). Glimpse does include grep-like functionality, but in the form of agrep, which is designed for approximate matching. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no