From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 01:37:15 2010 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E98031065679; Mon, 23 Aug 2010 01:37:14 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from mail.farley.org (mail.farley.org [IPv6:2001:470:1f0f:20:2::11]) by mx1.freebsd.org (Postfix) with ESMTP id 96D3C8FC12; Mon, 23 Aug 2010 01:37:14 +0000 (UTC) Received: from thor.farley.org (HPooka@thor.farley.org [IPv6:2001:470:1f0f:20:1::5]) by mail.farley.org (8.14.4/8.14.4) with ESMTP id o7N1b9Lf071486; Sun, 22 Aug 2010 20:37:09 -0500 (CDT) (envelope-from scf@FreeBSD.org) Date: Sun, 22 Aug 2010 20:37:09 -0500 (CDT) From: "Sean C. Farley" To: Tim Kientzle In-Reply-To: <628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com> Message-ID: References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="56599777-2042966741-1282527429=:93799" X-Spam-Status: No, score=-1.3 required=4.0 tests=AWL,BAYES_00,SPF_SOFTFAIL autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.farley.org Cc: =?ISO-8859-15?Q?Dag-Erling_Sm=F8rgrav?= , freebsd-current@FreeBSD.org, Mike Haertel , gabor@FreeBSD.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 01:37:15 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --56599777-2042966741-1282527429=:93799 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: 8BIT On Sun, 22 Aug 2010, Tim Kientzle wrote: > On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote: >> On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote: >>> Mike Haertel writes: >>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks >>>> first for the final letter of the target string, and uses a lookup >>>> table to tell it how far ahead it can skip in the input whenever it >>>> finds a non-matching character. >>> >>> Boyer-Moore is for fixed search strings. I don't see how that >>> optimization can work with a regexp search unless the regexp is so >>> simple that you break it down into a small number of cases with >>> known length and final character. >> >> When I was working on making FreeGrep faster (years ago), I wrote >> down a few notes about possible algorithms, especially those that >> could be useful for fgrep functionality. I am just passing these >> onto the list. >> >> Some algorithms: >> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm >> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm >> 3. GNU fgrep: Commentz-Walter >> 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant) >> >> Also, this may be a useful book: >> http://www.dcc.uchile.cl/~gnavarro/FPMbook/ > > And of course, Russ Cox' excellent series of articles starting at: > > http://swtch.com/~rsc/regexp/regexp1.html I saved that link from an E-mail earlier because it looked very interesting. > Later on, he summarizes some of the existing implementations, > including comments about the Plan 9 implementation and his own RE2, > both of which efficiently handle international text (which seems to be > a major concern of Gabor's). I believe Gabor is considering TRE for a good replacement regex library. > The key comment in Mike's GNU grep notes is the one about not breaking > into lines. That's simply double-scanning the input; instead, run the > matcher over blocks of text and, when it finds a match, work backwards > from the match to find the appropriate line beginning. This is > efficient because most lines don't match. I do like the idea. > Boyer-Moore is great for fixed strings (a very common use case for > grep) and for more complex patterns that contain long fixed strings > (helps to discard most lines early). Sophisticated regex matchers > implement a number of strategies and choose different ones depending > on the pattern. That is what fastgrep (in bsdgrep) attempts to accomplish with very simply regex lines (beginning of line, end of line and dot). > In the case of bsdgrep, it might make sense to use the regex library > for the general case but implement a hand-tuned search for fixed > strings that can be heavily optimized for that case. Of course, > international text support complicates the picture; you have to > consider the input character set (if you want to auto-detect Unicode > encodings by looking for leading BOMs, for example, you either need to > translate the fixed-string pattern to match the input encoding or > vice-versa). BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a faster search bypassing the regex library. :) It certainly was not written with any encodings in mind; it was purely ASCII. As I have not kept up with it, I do not know if anyone improved it or not. Sean -- scf@FreeBSD.org --56599777-2042966741-1282527429=:93799--