From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 13:42:59 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4AD0A1065670; Mon, 23 Aug 2010 13:42:59 +0000 (UTC) (envelope-from chrisk-freebsd@list.mightyreason.com) Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182]) by mx1.freebsd.org (Postfix) with ESMTP id 1DC9E8FC12; Mon, 23 Aug 2010 13:42:58 +0000 (UTC) Received: from chris-kuklewiczs-macbook-pro.local (unknown [137.195.54.21]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 9B16322E1F3; Mon, 23 Aug 2010 09:42:57 -0400 (EDT) Message-ID: <4C727AE0.8090709@list.mightyreason.com> Date: Mon, 23 Aug 2010 14:42:56 +0100 From: chrisk-freebsd@list.mightyreason.com User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: Bob Bishop References: <4C7260CC.9070407@list.mightyreason.com> <1179D92F-9AB7-43A4-8943-7DA58F5E234D@gid.co.uk> In-Reply-To: <1179D92F-9AB7-43A4-8943-7DA58F5E234D@gid.co.uk> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Mon, 23 Aug 2010 13:45:53 +0000 Cc: freebsd-current@freebsd.org, gabor@freebsd.org Subject: Re: grep and Regular expression correctness X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 13:42:59 -0000 On 23/08/2010 13:45, Bob Bishop wrote: > On 23 Aug 2010, at 12:51, chrisk-freebsd@list.mightyreason.com wrote: >> [...]The hardest part of POSIX regular expressions is in picking out the >> correct captured subexpressions. This makes programs like "sed" >> vulnerable to the bad regex.h engine that comes with the operating system. >> >> Luckily grep does not need the captured subexpressions, and thus does >> not need the complexity that comes from the ideas behind TRE. [etc] > I think grep does potentially need the captured subexpressions, for eg: > > \([abc]\)99\1 > > matching eg b99b That would be "basic POSIX" instead of "extended POSIX" regular expressions. Making basic regular expressions correct is something I have never attempted. Correctness is what I would like to emphasize. GNU grep defines the regex.h header for POSIX regular expressions but does not try to deliver the correct POSIX answer. Getting the wrong answer quickly is not a virtue as debugging prematurely optimized code is too hard. http://www2.research.att.com/~gsf/testregex/re-interpretation.html has the best explanation of basic and extended POSIX regular expressions. And last I checked the AT&T code was nearly correct but exponentially slow. I hope that if "grep" in *BSD and thus OSX gets worked on that it gets more correct, not less. I hope that someday the regex.h engine in *BSD and thus OSX gets fixed, but I am not holding my breath for that. All I have written is a correct (and efficient) "extended POSIX" matching engine in Haskell and OCaml. Cheers, Chris Kuklewicz