From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 22:11:50 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6EE821065697; Sun, 22 Aug 2010 22:11:50 +0000 (UTC) (envelope-from mike@ducky.net) Received: from ducky.net (ducky.net [71.216.22.205]) by mx1.freebsd.org (Postfix) with ESMTP id 345BF8FC20; Sun, 22 Aug 2010 22:11:49 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ducky.net (8.14.3/8.14.3) with ESMTP id o7MMBeC8065026; Sun, 22 Aug 2010 15:11:40 -0700 (PDT) (envelope-from mike@ducky.net) Message-Id: <201008222211.o7MMBeC8065026@ducky.net> X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2 To: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= In-reply-to: <86lj7yl6n6.fsf@ds4.des.no> References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <201008221822.o7MIMCRN050408@ducky.net> <86lj7yl6n6.fsf@ds4.des.no> Comments: In-reply-to =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= message dated "Sun, 22 Aug 2010 22:44:13 +0200." Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Date: Sun, 22 Aug 2010 15:11:40 -0700 From: Mike Haertel X-Mailman-Approved-At: Sun, 22 Aug 2010 23:25:37 +0000 Cc: mike@ducky.net, gabor@freebsd.org, freebsd-current@freebsd.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 22:11:50 -0000 Dag-Erling Sm=F8rgrav writes: > Mike Haertel writes: > > For example if your regex is /foo.*bar/, the initial Boyer-Moore sear= ch > > is (probably) searching for foo. > > > > If the initial search succeeds, GNU grep isolates the containing line= , > > and then runs the full regex matcher on that line to make sure. >=20 > You don't really need to =22isolate the containing line=22 unless you h= ave > an actual match, do you? There are two cases: Theoretically no. However, suppose the pattern was /foo.*blah/. The Boyer-Moore search will be for =22blah=22, since that's the longest fixed substring. But verifying a match for the full regexp either requires a regexp matcher with the feature =22start here, at THIS point in the middle of the RE and THAT point in the middle of the buffer, and match backwards and forwards=22, or else running a more standard RE matcher starting from the beginning of the line. So, in practice you pretty much have to at least search backwards for the preceding newline. As you mentioned, you can avoid searching forwards for the next newline if your RE matcher supports using newline as an exit marker. But if the workload characteristics are that matching lines are scarce compared to the input, this is an optimization that just won't matter much either way.