Date: Mon, 09 May 2011 01:29:32 +0100 From: Gabor Kovesdan <gabor@kovesdan.org> To: hackers@FreeBSD.org Cc: "Pedro F. Giffuni" <giffunip@yahoo.com>, Brooks Davis <brooks@freebsd.org> Subject: [RFC] Replacing our regex implementation Message-ID: <4DC7356C.20905@kovesdan.org>
next in thread | raw e-mail | index | archive | help
Hi Folks, I've been given the opportunity to work in GSoC 2011 to replace our base regex library with a more modern one and given that the regex code is something essential probably there are lots of interested parties so I decided to open a thread here about my plans and the approach that I propose. My wiki page is here, you can also take a look at it: http://wiki.freebsd.org/G%C3%A1borSoC2011 So let's see what this project is about... First some background information. Why to replace? - it is not efficient - it does not support wide characters - it comes from old and unmaintained vendor code What to use instead? There are some free implementations around but not all of the suit our needs. - PCRE is not POSIX regex, it only has a POSIX interface but actually the regex syntax is not the same - Oniguruma is slooooooooooooow - The Plan9 implementation doesn't support full POSIX regexes - TRE is BSD-licensed, fast, supports wchar, well-documented and still actively maintained, so it is the best candidate Tasks to complete (in priority order) 1, Replace the libc code with TRE code. Should be quite straightforward given that TRE is quite mature and has good POSIX-compliance. At the moment I'm already working on this. So far, I've discovered two differences from our current implementation - It accepts leaving at the minimum repetition count in curly brackets and inferres 0 instead of them. E.g. {,8} is equivalent to {0,8} and {,} is equivalent to {0,}. I think it won't be a problem because it is more permissive than our current implementation so the supported expressions are a superset of the current ones not a subset. - It doesn't provide the REG_STARTEND macro, which is our non-POSIX extension. Still, it is useful and easy to implement so it is not a problem either. Now I'm working on this little feature and on building a libc with TRE. After that I'll publish a patch for testing and will also ask portmgr to run it on the cluster. 2, Optimizations for matching with a fixed pattern heuristic Fixed string matching is algorithmically much cheaper than regex. Some time ago the GNU grep author was so kind to me that he provided lots of valuable comments while GNU grep is fast and one of those was this heuristical matching. You can see that mail here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019358.html Although TRE has some alternative interfaces to match with an explicitly specified length instead of using NUL-terminated strings, the performance has to be measured and if it's not efficient enough, I plan to implement such optimizations. I don't want to put them in BSD grep directly because I find such optiizations valuable for other regex users. First, I was thinking of putting it into TRE but now I consider a better solution building a small library, libregexutils or such. It would decouple this optimization from the vendor code, making it easier to deal with future updates and in case for some reason in the far future we decide again to replace our regex library, this would still work on top of another one. 3, Adding support for GNU-specific permissive regex syntax GNU grep accepts regexes that are invalid in POSIX, like [a|]. This is necessary for grep and diff in the base system. If we don't have them we can never trow out the GNU regex implementation. However, we should not make them default, so I'm thinking of implementing them somewhere else, e.g. in the previously mentioned library. So far, these are the plans, please comments if you have something in your mind about it. Gabor
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DC7356C.20905>