From owner-freebsd-current@FreeBSD.ORG  Mon Aug 23 08:12:14 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5006D10656A8;
	Mon, 23 Aug 2010 08:12:14 +0000 (UTC) (envelope-from des@des.no)
Received: from smtp.des.no (smtp.des.no [194.63.250.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 11EF18FC17;
	Mon, 23 Aug 2010 08:12:13 +0000 (UTC)
Received: from ds4.des.no (des.no [84.49.246.2])
	by smtp.des.no (Postfix) with ESMTP id 08B281FFC37;
	Mon, 23 Aug 2010 08:12:13 +0000 (UTC)
Received: by ds4.des.no (Postfix, from userid 1001)
	id CF2C28452E; Mon, 23 Aug 2010 10:12:12 +0200 (CEST)
From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To: Mike Haertel <mike@ducky.net>
References: <201008210231.o7L2VRvI031700@ducky.net>
	<86k4nikglg.fsf@ds4.des.no> <201008221822.o7MIMCRN050408@ducky.net>
	<86lj7yl6n6.fsf@ds4.des.no> <201008222211.o7MMBeC8065026@ducky.net>
Date: Mon, 23 Aug 2010 10:12:12 +0200
In-Reply-To: <201008222211.o7MMBeC8065026@ducky.net> (Mike Haertel's message
	of "Sun, 22 Aug 2010 15:11:40 -0700")
Message-ID: <867hjhzr1f.fsf@ds4.des.no>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-current@freebsd.org, gabor@freebsd.org
Subject: Re: why GNU grep is fast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 08:12:14 -0000

Mike Haertel <mike@ducky.net> writes:
> Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
> > You don't really need to "isolate the containing line" unless you have
> > an actual match, do you?
> Theoretically no.  However, suppose the pattern was /foo.*blah/.
> The Boyer-Moore search will be for "blah", since that's the longest
> fixed substring.  But verifying a match for the full regexp either
> requires a regexp matcher with the feature "start here, at THIS point
> in the middle of the RE and THAT point in the middle of the buffer,
> and match backwards and forwards", or else running a more standard
> RE matcher starting from the beginning of the line.

Stupidly enough, I only considered the case where the fixed search
string is at the end of the regexp :)  For the general case, you'd have
to either build a second FSA that mirrors the first, or design your
engine and data structures in such a way that the FSA can be traversed
in either direction.  Then you note which states correspond to the
beginning and end of the fixed substring, and run the FSA forward from
the end and backward from the beginning.  That could prove advantageous
when the input consists of very long lines and the frequency of partial
matches is relatively high; large files with very long lines are very
common in bioinformatics.

> As you mentioned, you can avoid searching forwards for the next
> newline if your RE matcher supports using newline as an exit marker.

Umm, I don't understand why it needs to support using *anything* as an
exit marker.  The newline character will simply not match any of the
transitions from whichever state the FSA is currently in, and neither
will the null character.  The only bit that needs to know about them is
the bit that handles end-of-line and end-of-word anchors.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no