From owner-freebsd-arch@FreeBSD.ORG Mon Nov 22 10:48:43 2004 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3AD6416A505 for ; Mon, 22 Nov 2004 10:48:41 +0000 (GMT) Received: from smtp02.syd.iprimus.net.au (smtp02.syd.iprimus.net.au [210.50.76.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id EC1F443D41 for ; Mon, 22 Nov 2004 10:48:40 +0000 (GMT) (envelope-from tjr@freebsd.org) Received: from robbins.dropbear.id.au (210.50.218.183) by smtp02.syd.iprimus.net.au (7.0.036) id 4192B046004CACC8; Mon, 22 Nov 2004 21:48:23 +1100 Received: from 192.168.0.64 (tim.robbins.dropbear.id.au [192.168.0.64]) by robbins.dropbear.id.au (Postfix) with ESMTP id 9FB854242; Mon, 22 Nov 2004 21:48:21 +1100 (EST) From: Tim Robbins To: Sean Chittenden In-Reply-To: <01E8B7B2-3BE8-11D9-905D-000A95C705DC@chittenden.org> References: <16795.57534.19299.407779@piglet.timing.com> <01E8B7B2-3BE8-11D9-905D-000A95C705DC@chittenden.org> Content-Type: text/plain Date: Mon, 22 Nov 2004 21:48:21 +1100 Message-Id: <1101120501.4321.24.camel@starshine.robbins.dropbear.id.au> Mime-Version: 1.0 X-Mailer: Evolution 2.0.2 (2.0.2-3) Content-Transfer-Encoding: 7bit cc: Dag-Erling =?ISO-8859-1?Q?Sm=F8rgrav?= cc: freebsd-arch@freebsd.org Subject: Re: libregex library X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Nov 2004 10:48:43 -0000 On Sun, 2004-11-21 at 10:06 -0800, Sean Chittenden wrote: > >> Has there been any thought given to moving to the modified Henry > >> Spencer regex library used in NetBSD & OpenBSD's libc? > > > > des@dwp ~% head -3 /usr/src/lib/libc/regex/COPYRIGHT > > Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved. > > This software is not subject to any license of the American Telephone > > and Telegraph Company or of the Regents of the University of > > California. > > I think maybe what Ben was referring to was that Spencer has released > an updated version of his regexp library that doesn't penalize wide > character locales. I believe our current one performs terribly on > everything but one byte character sets, whereas the newer Spencer > library performs as well as one could hope with wide characters. The > PostgreSQL group did some testing and found Spencers library to be the > fastest wide character regexp engine while still maintaining very good > levels of performance for single byte character sets. You'll have to > check the PostgreSQL archives for details: it's been two years since > that change was committed to their tree. -sc I think you'd be surprised at how poorly Henry Spencer's new code performs in all but the most contrived test cases, regardless of locale. You'll find that it performs especially poorly in multibyte locales because the matcher itself does not work directly with multibyte characters. Instead, the strings must first be entirely converted to wide characters, which means reading every single input byte, calling mbrtowc() on it, then storing the results in temporary scratch space, even if the characters don't participate in the match at all (e.g. all characters but the first when matching against patterns like "^x"). The FreeBSD 5 regex code only performs the conversion when necessary, and can often reject impossible matches without performing a single conversion in single-byte and UTF-8 locales. (This is assuming your input strings are given as multibyte character strings, as is common in UNIX, not wide character strings, as may be common in PostgreSQL). Tim