From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 16:30:08 2010 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F20A61065693; Sun, 22 Aug 2010 16:30:07 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from mail.farley.org (mail.farley.org [IPv6:2001:470:1f0f:20:2::11]) by mx1.freebsd.org (Postfix) with ESMTP id 9E0E58FC0A; Sun, 22 Aug 2010 16:30:07 +0000 (UTC) Received: from [192.168.1.201] ([192.168.1.201]) by mail.farley.org (8.14.4/8.14.4) with ESMTP id o7MGU1s0063186; Sun, 22 Aug 2010 11:30:01 -0500 (CDT) (envelope-from scf@FreeBSD.org) Date: Sun, 22 Aug 2010 11:30:01 -0500 (CDT) From: "Sean C. Farley" To: =?ISO-8859-15?Q?Dag-Erling_Sm=F8rgrav?= In-Reply-To: <86k4nikglg.fsf@ds4.des.no> Message-ID: References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="4255886656-1237390696-1282494601=:1989" X-Spam-Status: No, score=-2.9 required=4.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.farley.org Cc: freebsd-current@FreeBSD.org, Mike Haertel , gabor@FreeBSD.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 16:30:08 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --4255886656-1237390696-1282494601=:1989 Content-Type: TEXT/PLAIN; charset=utf-8; format=flowed Content-Transfer-Encoding: 8BIT On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: > Mike Haertel writes: >> GNU grep uses the well-known Boyer-Moore algorithm, which looks first >> for the final letter of the target string, and uses a lookup table to >> tell it how far ahead it can skip in the input whenever it finds a >> non-matching character. > > Boyer-Moore is for fixed search strings. I don't see how that > optimization can work with a regexp search unless the regexp is so > simple that you break it down into a small number of cases with known > length and final character. When I was working on making FreeGrep faster (years ago), I wrote down a few notes about possible algorithms, especially those that could be useful for fgrep functionality. I am just passing these onto the list. Some algorithms: 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm 3. GNU fgrep: Commentz-Walter 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant) Also, this may be a useful book: http://www.dcc.uchile.cl/~gnavarro/FPMbook/ Sean -- scf@FreeBSD.org --4255886656-1237390696-1282494601=:1989--