From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 08:12:14 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5006D10656A8; Mon, 23 Aug 2010 08:12:14 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 11EF18FC17; Mon, 23 Aug 2010 08:12:13 +0000 (UTC) Received: from ds4.des.no (des.no [84.49.246.2]) by smtp.des.no (Postfix) with ESMTP id 08B281FFC37; Mon, 23 Aug 2010 08:12:13 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id CF2C28452E; Mon, 23 Aug 2010 10:12:12 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Mike Haertel References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <201008221822.o7MIMCRN050408@ducky.net> <86lj7yl6n6.fsf@ds4.des.no> <201008222211.o7MMBeC8065026@ducky.net> Date: Mon, 23 Aug 2010 10:12:12 +0200 In-Reply-To: <201008222211.o7MMBeC8065026@ducky.net> (Mike Haertel's message of "Sun, 22 Aug 2010 15:11:40 -0700") Message-ID: <867hjhzr1f.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-current@freebsd.org, gabor@freebsd.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 08:12:14 -0000 Mike Haertel writes: > Dag-Erling Sm=C3=B8rgrav writes: > > You don't really need to "isolate the containing line" unless you have > > an actual match, do you? > Theoretically no. However, suppose the pattern was /foo.*blah/. > The Boyer-Moore search will be for "blah", since that's the longest > fixed substring. But verifying a match for the full regexp either > requires a regexp matcher with the feature "start here, at THIS point > in the middle of the RE and THAT point in the middle of the buffer, > and match backwards and forwards", or else running a more standard > RE matcher starting from the beginning of the line. Stupidly enough, I only considered the case where the fixed search string is at the end of the regexp :) For the general case, you'd have to either build a second FSA that mirrors the first, or design your engine and data structures in such a way that the FSA can be traversed in either direction. Then you note which states correspond to the beginning and end of the fixed substring, and run the FSA forward from the end and backward from the beginning. That could prove advantageous when the input consists of very long lines and the frequency of partial matches is relatively high; large files with very long lines are very common in bioinformatics. > As you mentioned, you can avoid searching forwards for the next > newline if your RE matcher supports using newline as an exit marker. Umm, I don't understand why it needs to support using *anything* as an exit marker. The newline character will simply not match any of the transitions from whichever state the FSA is currently in, and neither will the null character. The only bit that needs to know about them is the bit that handles end-of-line and end-of-word anchors. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no