From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 15:26:19 2010 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E0F63106564A for ; Sun, 22 Aug 2010 15:26:19 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (hergotha.csail.mit.edu [66.92.79.170]) by mx1.freebsd.org (Postfix) with ESMTP id 928CF8FC17 for ; Sun, 22 Aug 2010 15:26:19 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.4/8.14.4) with ESMTP id o7MF2bvm086739; Sun, 22 Aug 2010 11:02:37 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.4/8.14.4/Submit) id o7MF2bfZ086738; Sun, 22 Aug 2010 11:02:37 -0400 (EDT) (envelope-from wollman) Date: Sun, 22 Aug 2010 11:02:37 -0400 (EDT) From: Garrett Wollman Message-Id: <201008221502.o7MF2bfZ086738@hergotha.csail.mit.edu> To: des@des.no In-Reply-To: <86k4nikglg.fsf@ds4.des.no> References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> Organization: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.5 (hergotha.csail.mit.edu [127.0.0.1]); Sun, 22 Aug 2010 11:02:37 -0400 (EDT) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on hergotha.csail.mit.edu X-Mailman-Approved-At: Sun, 22 Aug 2010 15:41:38 +0000 Cc: current@freebsd.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 15:26:20 -0000 In article <86k4nikglg.fsf@ds4.des.no> you write: >Mike Haertel writes: >> GNU grep uses the well-known Boyer-Moore algorithm, which looks >> first for the final letter of the target string, and uses a lookup >> table to tell it how far ahead it can skip in the input whenever >> it finds a non-matching character. > >Boyer-Moore is for fixed search strings. I don't see how that >optimization can work with a regexp search unless the regexp is so >simple that you break it down into a small number of cases with known >length and final character. The common case of regexps used in the "grep" utility (and, for obvious reasons, universal in the "fgrep" utility) is fixed-length search strings. Even non-fixed-length regexps typically consist of one one or two variable-length parts. Matching a completely variable-length regexp is just hard, computationally, so it's OK for it to be slower. There are other tricks you can do, such as turning the anchors ^ and $ into explicit newlines in your search -- "^foo" is a very common regexp to search for, and it's really a fixed-string search for "\nfoo" which is entirely amenable to the B-M treatment. You just have to remember that a matched newline isn't part of the result. The GNU regexp library also uses the Boyer-Moore (or is it Boyer-Moore-Gosper?) strategy. -GAWollman