From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 15:04:12 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7C11E106564A for ; Mon, 23 Aug 2010 15:04:12 +0000 (UTC) (envelope-from gabor@FreeBSD.org) Received: from server.mypc.hu (server.mypc.hu [87.229.73.95]) by mx1.freebsd.org (Postfix) with ESMTP id 09DCF8FC0C for ; Mon, 23 Aug 2010 15:04:11 +0000 (UTC) Received: from server.mypc.hu (localhost [127.0.0.1]) by server.mypc.hu (Postfix) with ESMTP id 210F614DC7A5 for ; Mon, 23 Aug 2010 17:04:10 +0200 (CEST) X-Virus-Scanned: amavisd-new at server.mypc.hu Received: from server.mypc.hu ([127.0.0.1]) by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Uj+lvcWQ13Vr for ; Mon, 23 Aug 2010 17:04:07 +0200 (CEST) Received: from [192.168.1.105] (catv-80-99-92-167.catv.broadband.hu [80.99.92.167]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by server.mypc.hu (Postfix) with ESMTPSA id 9DC8914DC689 for ; Mon, 23 Aug 2010 17:04:07 +0200 (CEST) Message-ID: <4C728DE5.4060809@FreeBSD.org> Date: Mon, 23 Aug 2010 17:04:05 +0200 From: Gabor Kovesdan User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: freebsd-current@freebsd.org References: <201008210231.o7L2VRvI031700@ducky.net> In-Reply-To: <201008210231.o7L2VRvI031700@ducky.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: What to learn from the BSD grep case [Was: why GNU grep is fast] X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 15:04:12 -0000 Hi all, there are some consequences that we can see from the grep case. Here I'd like to add a summary, which raises some questions. All comments are welcome. 1, When grep entered -CURRENT and bugs were found I immediately got kind bug reports and sharp criticism, as well. According to my understanding, -CURRENT is for development and it's fine to expose new pieces of work there but now I'm in doubt about that because of complaining people. On the other hand, an earlier version of BSD grep has been in the ports tree for a very long time and users reported some problems, which have been fixed but still, there is a lot of bugs there which haven't been reported that time. If users don't volunteer to test new pieces of code on a volunteer basis, somehow we have to make them test it, so I think committing BSD grep to -CURRENT was a good decision in the first round. 2, This issue also brought up some bottlenecks and potential optimization points (like memchr() and mmap), which other softwre may benefit from. This is another reason to let such pieces of work in. But unfortunately, this means that noone profiled another utilities because these bottlenecks remained undiscovered. Neither did I. It's a lesson that we have to learn from this particular case. 3, Because of point 2, we need more content to developers-handbook to help development with such ideas and best practices. It has been also raised on another list that our end-user documentation isn't that shiny and cool that it used to be and actually, developers-handbook has never been "finished" to be more or less complete. If someone looks at it, it looks like a sketch, not a book. I'll see if I can write a section on profiling. 4, We really need a good regex library. From the comments, it seems there's no such in the open source world. GNU libregex isn't efficient because GNU grep uses those workarounds that Mike kindly pointed out. Oniguruma was extremely slow when I checked it. PCRE supports Perl-style syntax with a POSIX-like API but not POSIX regex. Google RE2 is the same with additional egrep syntax but doesn't have support for standard POSIX regexes. Plan 9 regex only supports egrep syntax. It seems that TRE is the best choice. It is BSD-licensed, supports wchar and POSIX(ish) regexes and it is quite fast. I don't know the theoretical background of regex engines but I'm wondering if it's possible top provide an alternative API with byte-counted buffers and use the heuristical speedup with fixed string matching. As Mike pointed out the POSIX API is quite limiting because it works on NUL-terminated strings and not on byte-counted buffers, so we couldn't just do it with a POSIX-conformant library but it would be nice if we could implement it in such a library with an alternative interface. Gabor