Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Aug 2010 15:11:40 -0700
From:      Mike Haertel <mike@ducky.net>
To:        =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
Cc:        mike@ducky.net, gabor@freebsd.org, freebsd-current@freebsd.org
Subject:   Re: why GNU grep is fast 
Message-ID:  <201008222211.o7MMBeC8065026@ducky.net>
In-Reply-To: <86lj7yl6n6.fsf@ds4.des.no> 
References:  <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <201008221822.o7MIMCRN050408@ducky.net> <86lj7yl6n6.fsf@ds4.des.no>

next in thread | previous in thread | raw e-mail | index | archive | help
Dag-Erling Sm=F8rgrav <des=40des.no> writes:
> Mike Haertel <mike=40ducky.net> writes:
> > For example if your regex is /foo.*bar/, the initial Boyer-Moore sear=
ch
> > is (probably) searching for foo.
> >
> > If the initial search succeeds, GNU grep isolates the containing line=
,
> > and then runs the full regex matcher on that line to make sure.
>=20
> You don't really need to =22isolate the containing line=22 unless you h=
ave
> an actual match, do you?  There are two cases:

Theoretically no.  However, suppose the pattern was /foo.*blah/.
The Boyer-Moore search will be for =22blah=22, since that's the longest
fixed substring.  But verifying a match for the full regexp either
requires a regexp matcher with the feature =22start here, at THIS point
in the middle of the RE and THAT point in the middle of the buffer,
and match backwards and forwards=22, or else running a more standard
RE matcher starting from the beginning of the line.

So, in practice you pretty much have to at least search backwards
for the preceding newline.

As you mentioned, you can avoid searching forwards for the next
newline if your RE matcher supports using newline as an exit marker.
But if the workload characteristics are that matching lines are
scarce compared to the input, this is an optimization that just
won't matter much either way.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201008222211.o7MMBeC8065026>