From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 17:31:50 2010 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 559861065698 for ; Sun, 22 Aug 2010 17:31:50 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com [209.85.160.54]) by mx1.freebsd.org (Postfix) with ESMTP id 3377D8FC15 for ; Sun, 22 Aug 2010 17:31:49 +0000 (UTC) Received: by pwi8 with SMTP id 8so450425pwi.13 for ; Sun, 22 Aug 2010 10:31:49 -0700 (PDT) Received: by 10.143.9.6 with SMTP id m6mr3184455wfi.310.1282498309659; Sun, 22 Aug 2010 10:31:49 -0700 (PDT) Received: from [10.123.2.181] (99-74-170-25.lightspeed.sntcca.sbcglobal.net [99.74.170.25]) by mx.google.com with ESMTPS id v38sm7179056wfh.0.2010.08.22.10.31.47 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 22 Aug 2010 10:31:48 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii From: Tim Kientzle In-Reply-To: <201008221502.o7MF2bfZ086738@hergotha.csail.mit.edu> Date: Sun, 22 Aug 2010 10:31:46 -0700 Content-Transfer-Encoding: 7bit Message-Id: References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <201008221502.o7MF2bfZ086738@hergotha.csail.mit.edu> To: Garrett Wollman X-Mailer: Apple Mail (2.1081) Cc: current@freebsd.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 17:31:50 -0000 On Aug 22, 2010, at 8:02 AM, Garrett Wollman wrote: > In article <86k4nikglg.fsf@ds4.des.no> you write: >> Mike Haertel writes: >>> GNU grep uses the well-known Boyer-Moore algorithm, which looks >>> first for the final letter of the target string, and uses a lookup >>> table to tell it how far ahead it can skip in the input whenever >>> it finds a non-matching character. >> >> Boyer-Moore is for fixed search strings. I don't see how that >> optimization can work with a regexp search unless the regexp is so >> simple that you break it down into a small number of cases with known >> length and final character. > > The common case of regexps used in the "grep" utility (and, for > obvious reasons, universal in the "fgrep" utility) is fixed-length > search strings. Even non-fixed-length regexps typically consist of > one one or two variable-length parts. This is an important point: A good grep implementation will use different strategies depending on the input regexp. Fixed-string matching is a very important special case. > Matching a completely > variable-length regexp is just hard, computationally, See Russ Cox' articles for why this is not true. It does require considerable sophistication to build an efficient DFA but the actual matcher, once built, can run very fast indeed: http://swtch.com/~rsc/regexp/ Tim