From owner-freebsd-hackers@FreeBSD.ORG Mon May 9 01:49:39 2011 Return-Path: Delivered-To: hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9E60B1065701; Mon, 9 May 2011 01:49:39 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (ns1.bitblocks.com [173.228.5.8]) by mx1.freebsd.org (Postfix) with ESMTP id 8356E8FC17; Mon, 9 May 2011 01:49:39 +0000 (UTC) Received: from bitblocks.com (localhost [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id EE292B827; Sun, 8 May 2011 18:49:38 -0700 (PDT) To: Gabor Kovesdan In-reply-to: Your message of "Mon, 09 May 2011 02:37:10 BST." <4DC74546.1060902@kovesdan.org> References: <4DC7356C.20905@kovesdan.org> <20110509011709.5455CB827@mail.bitblocks.com> <4DC74546.1060902@kovesdan.org> Comments: In-reply-to Gabor Kovesdan message dated "Mon, 09 May 2011 02:37:10 +0100." Date: Sun, 08 May 2011 18:49:38 -0700 From: Bakul Shah Message-Id: <20110509014938.EE292B827@mail.bitblocks.com> Cc: "Pedro F. Giffuni" , hackers@FreeBSD.org, Brooks Davis Subject: Re: [RFC] Replacing our regex implementation X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 May 2011 01:49:39 -0000 On Mon, 09 May 2011 02:37:10 BST Gabor Kovesdan wrote: > Em 09-05-2011 02:17, Bakul Shah escreveu: > > As per the following URLs re2 is much faster than TRE (on the > > benchmarks they ran): > > > > http://lh3lh3.users.sourceforge.net/reb.shtml > > http://sljit.sourceforge.net/regex_perf.html > > > > re2 is in C++& has a PCRE API, while TRE is in C& has a > > POSIX API. Both have BSD copyright. Is it worth considering > > making re2 posix compliant? > Is it wchar-clean and is it actively maintained? C++ is quite > anticipated for the base system and I'm not very skilled in it so atm I > couldn't promise to use re2 instead of TRE. And anyway, can C++ go into > libc? According to POSIX, the regex code has to be there. But let's see > what others say... If we happen to use re2 later, my extensions that I > talked about in points 2, and 3, would still be useful. > > Anyway, according to some earlier vague measures, TRE seems to be slower > in small matching tasks but scales well. These tests seem to compare > only short runs with the same regex. It should be seem how they compare > e.g. if you grep the whole ports tree with the same pattern. If the > matching scales well once the pattern is compiled, that's more important > than the overall result for such short tasks, imho. re2 is certainly maintained. Don't know about whcar cleanliness. See http://code.google.com/p/re2/ Also check out Russ Cox's excellent articles on implementing it http://swtch.com/~rsc/regexp/ and this: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html C++ may be an impediment for it to go into libc but one can certainly put a C interface on a C++ library.