From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 10:23:09 2010 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8EE8010656A4; Mon, 23 Aug 2010 10:23:09 +0000 (UTC) (envelope-from gabor@FreeBSD.org) Received: from server.mypc.hu (server.mypc.hu [87.229.73.95]) by mx1.freebsd.org (Postfix) with ESMTP id 44AF28FC0A; Mon, 23 Aug 2010 10:23:09 +0000 (UTC) Received: from server.mypc.hu (localhost [127.0.0.1]) by server.mypc.hu (Postfix) with ESMTP id 8D2B314DC799; Mon, 23 Aug 2010 12:23:08 +0200 (CEST) X-Virus-Scanned: amavisd-new at server.mypc.hu Received: from server.mypc.hu ([127.0.0.1]) by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Xzcdovdv9erp; Mon, 23 Aug 2010 12:23:06 +0200 (CEST) Received: from [192.168.1.105] (catv-80-99-92-167.catv.broadband.hu [80.99.92.167]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by server.mypc.hu (Postfix) with ESMTPSA id 4836914DC75D; Mon, 23 Aug 2010 12:23:06 +0200 (CEST) Message-ID: <4C724C09.6090104@FreeBSD.org> Date: Mon, 23 Aug 2010 12:23:05 +0200 From: Gabor Kovesdan User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: "Sean C. Farley" References: <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no> <628366E1-AF71-4A22-95AF-BC77A21C21A8@kientzle.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8rgr?=, freebsd-current@FreeBSD.org, Mike Haertel , =?ISO-8859-1?Q?av?= Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 10:23:09 -0000 > >> Later on, he summarizes some of the existing implementations, >> including comments about the Plan 9 implementation and his own RE2, >> both of which efficiently handle international text (which seems to >> be a major concern of Gabor's). > > I believe Gabor is considering TRE for a good replacement regex library. Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax but not standard regex and Plan 9 implementation iirc only supports fgrep syntax and Unicode but not wchar_t in general. > >> The key comment in Mike's GNU grep notes is the one about not >> breaking into lines. That's simply double-scanning the input; >> instead, run the matcher over blocks of text and, when it finds a >> match, work backwards from the match to find the appropriate line >> beginning. This is efficient because most lines don't match. > > I do like the idea. So do I. > > BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a > faster search bypassing the regex library. :) It certainly was not > written with any encodings in mind; it was purely ASCII. As I have > not kept up with it, I do not know if anyone improved it or not. > It has been made wchar-compliant. Gabor