Date: Mon, 23 Aug 2010 17:04:05 +0200 From: Gabor Kovesdan <gabor@FreeBSD.org> To: freebsd-current@freebsd.org Subject: What to learn from the BSD grep case [Was: why GNU grep is fast] Message-ID: <4C728DE5.4060809@FreeBSD.org> In-Reply-To: <201008210231.o7L2VRvI031700@ducky.net> References: <201008210231.o7L2VRvI031700@ducky.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi all, there are some consequences that we can see from the grep case. Here I'd like to add a summary, which raises some questions. All comments are welcome. 1, When grep entered -CURRENT and bugs were found I immediately got kind bug reports and sharp criticism, as well. According to my understanding, -CURRENT is for development and it's fine to expose new pieces of work there but now I'm in doubt about that because of complaining people. On the other hand, an earlier version of BSD grep has been in the ports tree for a very long time and users reported some problems, which have been fixed but still, there is a lot of bugs there which haven't been reported that time. If users don't volunteer to test new pieces of code on a volunteer basis, somehow we have to make them test it, so I think committing BSD grep to -CURRENT was a good decision in the first round. 2, This issue also brought up some bottlenecks and potential optimization points (like memchr() and mmap), which other softwre may benefit from. This is another reason to let such pieces of work in. But unfortunately, this means that noone profiled another utilities because these bottlenecks remained undiscovered. Neither did I. It's a lesson that we have to learn from this particular case. 3, Because of point 2, we need more content to developers-handbook to help development with such ideas and best practices. It has been also raised on another list that our end-user documentation isn't that shiny and cool that it used to be and actually, developers-handbook has never been "finished" to be more or less complete. If someone looks at it, it looks like a sketch, not a book. I'll see if I can write a section on profiling. 4, We really need a good regex library. From the comments, it seems there's no such in the open source world. GNU libregex isn't efficient because GNU grep uses those workarounds that Mike kindly pointed out. Oniguruma was extremely slow when I checked it. PCRE supports Perl-style syntax with a POSIX-like API but not POSIX regex. Google RE2 is the same with additional egrep syntax but doesn't have support for standard POSIX regexes. Plan 9 regex only supports egrep syntax. It seems that TRE is the best choice. It is BSD-licensed, supports wchar and POSIX(ish) regexes and it is quite fast. I don't know the theoretical background of regex engines but I'm wondering if it's possible top provide an alternative API with byte-counted buffers and use the heuristical speedup with fixed string matching. As Mike pointed out the POSIX API is quite limiting because it works on NUL-terminated strings and not on byte-counted buffers, so we couldn't just do it with a POSIX-conformant library but it would be nice if we could implement it in such a library with an alternative interface. Gabor
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C728DE5.4060809>