From owner-freebsd-current@FreeBSD.ORG  Mon Aug 23 13:42:59 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4AD0A1065670;
	Mon, 23 Aug 2010 13:42:59 +0000 (UTC)
	(envelope-from chrisk-freebsd@list.mightyreason.com)
Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 1DC9E8FC12;
	Mon, 23 Aug 2010 13:42:58 +0000 (UTC)
Received: from chris-kuklewiczs-macbook-pro.local (unknown [137.195.54.21])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by smtp.mxes.net (Postfix) with ESMTPSA id 9B16322E1F3;
	Mon, 23 Aug 2010 09:42:57 -0400 (EDT)
Message-ID: <4C727AE0.8090709@list.mightyreason.com>
Date: Mon, 23 Aug 2010 14:42:56 +0100
From: chrisk-freebsd@list.mightyreason.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
	rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2
MIME-Version: 1.0
To: Bob Bishop <rb@gid.co.uk>
References: <4C7260CC.9070407@list.mightyreason.com>
	<1179D92F-9AB7-43A4-8943-7DA58F5E234D@gid.co.uk>
In-Reply-To: <1179D92F-9AB7-43A4-8943-7DA58F5E234D@gid.co.uk>
X-Enigmail-Version: 1.1.1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Mailman-Approved-At: Mon, 23 Aug 2010 13:45:53 +0000
Cc: freebsd-current@freebsd.org, gabor@freebsd.org
Subject: Re: grep and Regular expression correctness
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 13:42:59 -0000

 On 23/08/2010 13:45, Bob Bishop wrote:
> On 23 Aug 2010, at 12:51, chrisk-freebsd@list.mightyreason.com wrote:
>> [...]The hardest part of POSIX regular expressions is in picking out the
>> correct captured subexpressions.  This makes programs like "sed"
>> vulnerable to the bad regex.h engine that comes with the operating system.
>>
>> Luckily grep does not need the captured subexpressions, and thus does
>> not need the complexity that comes from the ideas behind TRE. [etc]
> I think grep does potentially need the captured subexpressions, for eg:
>
> \([abc]\)99\1
>
> matching eg b99b

That would be "basic POSIX" instead of "extended POSIX" regular
expressions.  Making basic regular expressions correct is something I
have never attempted.

Correctness is what I would like to emphasize.  GNU grep defines the
regex.h header for POSIX regular expressions but does not try to deliver
the correct POSIX answer.  Getting the wrong answer quickly is not a
virtue as debugging prematurely optimized code is too hard.

http://www2.research.att.com/~gsf/testregex/re-interpretation.html has
the best explanation of basic and extended POSIX regular expressions. 
And last I checked the AT&T code was nearly correct but exponentially slow.

I hope that if "grep" in *BSD and thus OSX gets worked on that it gets
more correct, not less.

I hope that someday the regex.h engine in *BSD and thus OSX gets fixed,
but I am not holding my breath for that.

All I have written is a correct (and efficient) "extended POSIX"
matching engine in Haskell and OCaml.

Cheers,
  Chris Kuklewicz