From owner-freebsd-current@FreeBSD.ORG  Mon Aug 23 10:23:09 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8EE8010656A4;
	Mon, 23 Aug 2010 10:23:09 +0000 (UTC)
	(envelope-from gabor@FreeBSD.org)
Received: from server.mypc.hu (server.mypc.hu [87.229.73.95])
	by mx1.freebsd.org (Postfix) with ESMTP id 44AF28FC0A;
	Mon, 23 Aug 2010 10:23:09 +0000 (UTC)
Received: from server.mypc.hu (localhost [127.0.0.1])
	by server.mypc.hu (Postfix) with ESMTP id 8D2B314DC799;
	Mon, 23 Aug 2010 12:23:08 +0200 (CEST)
X-Virus-Scanned: amavisd-new at server.mypc.hu
Received: from server.mypc.hu ([127.0.0.1])
	by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024)
	with LMTP id Xzcdovdv9erp; Mon, 23 Aug 2010 12:23:06 +0200 (CEST)
Received: from [192.168.1.105] (catv-80-99-92-167.catv.broadband.hu
	[80.99.92.167])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by server.mypc.hu (Postfix) with ESMTPSA id 4836914DC75D;
	Mon, 23 Aug 2010 12:23:06 +0200 (CEST)
Message-ID: <4C724C09.6090104@FreeBSD.org>
Date: Mon, 23 Aug 2010 12:23:05 +0200
From: Gabor Kovesdan <gabor@FreeBSD.org>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT;
	rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2
MIME-Version: 1.0
To: "Sean C. Farley" <scf@FreeBSD.org>
References: <201008210231.o7L2VRvI031700@ducky.net>
	<86k4nikglg.fsf@ds4.des.no>	<alpine.BSF.2.00.1008221111300.1989@terminus>	<628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com>
	<alpine.BSF.2.00.1008222030080.93799@thor.farley.org>
In-Reply-To: <alpine.BSF.2.00.1008222030080.93799@thor.farley.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8rgr?=, freebsd-current@FreeBSD.org,
	Mike Haertel <mike@ducky.net>, =?ISO-8859-1?Q?av?= <des@des.no>
Subject: Re: why GNU grep is fast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 10:23:09 -0000


>
>> Later on, he summarizes some of the existing implementations, 
>> including comments about the Plan 9 implementation and his own RE2, 
>> both of which efficiently handle international text (which seems to 
>> be a major concern of Gabor's).
>
> I believe Gabor is considering TRE for a good replacement regex library.
Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax 
but not standard regex and Plan 9 implementation iirc only supports 
fgrep syntax and Unicode but not wchar_t in general.
>
>> The key comment in Mike's GNU grep notes is the one about not 
>> breaking into lines.  That's simply double-scanning the input; 
>> instead, run the matcher over blocks of text and, when it finds a 
>> match, work backwards from the match to find the appropriate line 
>> beginning.  This is efficient because most lines don't match.
>
> I do like the idea.
So do I.
>
> BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a 
> faster search bypassing the regex library.  :)  It certainly was not 
> written with any encodings in mind; it was purely ASCII.  As I have 
> not kept up with it, I do not know if anyone improved it or not.
>
It has been made wchar-compliant.

Gabor