From owner-freebsd-current@FreeBSD.ORG  Mon Aug 23 12:03:01 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6081710656A7;
	Mon, 23 Aug 2010 12:03:01 +0000 (UTC)
	(envelope-from chrisk-freebsd@list.mightyreason.com)
Received: from fallback-in2.mxes.net (fallback-out2.mxes.net [216.86.168.191])
	by mx1.freebsd.org (Postfix) with ESMTP id 3868C8FC1B;
	Mon, 23 Aug 2010 12:03:01 +0000 (UTC)
Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182])
	by fallback-in1.mxes.net (Postfix) with ESMTP id 8D6542FDBF2;
	Mon, 23 Aug 2010 07:51:43 -0400 (EDT)
Received: from chris-kuklewiczs-macbook-pro.local (unknown [137.195.54.21])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by smtp.mxes.net (Postfix) with ESMTPSA id 2E03B22E247;
	Mon, 23 Aug 2010 07:51:41 -0400 (EDT)
Message-ID: <4C7260CC.9070407@list.mightyreason.com>
Date: Mon, 23 Aug 2010 12:51:40 +0100
From: chrisk-freebsd@list.mightyreason.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
	rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2
MIME-Version: 1.0
To: freebsd-current@freebsd.org, gabor@FreeBSD.org
X-Enigmail-Version: 1.1.1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Mailman-Approved-At: Mon, 23 Aug 2010 12:26:45 +0000
Cc: 
Subject: grep and Regular expression correctness
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 12:03:01 -0000

I saw the discussion on this list with the subject "why GNU grep is
fast" and I thought I would contribute to a discussion of the
correctness of the POSIX regular expressions.

Gabor wrote:
>> I believe Gabor is considering TRE for a good replacement regex library.
> Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax 
> but not standard regex and Plan 9 implementation iirc only supports 
> fgrep syntax and Unicode but not wchar_t in general.

I specifically need to mention that the ideas used in the TRE algorithm
can work, but that libtre is very buggy (see footnote [1]).

The hardest part of POSIX regular expressions is in picking out the
correct captured subexpressions.  This makes programs like "sed"
vulnerable to the bad regex.h engine that comes with the operating system.

Luckily grep does not need the captured subexpressions, and thus does
not need the complexity that comes from the ideas behind TRE.

I use OS X which inherits an older freebsd suite of tools with bugs like
this:

> echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'

[XababaYababaZ] => [X][ab][aba][Y][b][Z]

Where the [b] between [Y] and [Z] is completely (insanely) wrong.
It cannot even get the leftmost-longest correct:

> echo "ab" | sed -E 's/(()|.)(b)/[\1][\3]/'
a[][b]

Where the answer should have been the same as

> echo "ab" | sed -E 's/(.|())(b)/[\1][\3]/'
[a][b]

I have no idea if these bugs are still present in freebsd-current.

Cheers,
  Chris Kuklewicz


[1] libtre has problems such as this (from Haskell's regex-tre wrapper)
>> "searchme" =~ "s(()|^)e" :: Array Int (MatchOffset,MatchLength)
> array (0,2) [(0,(0,2)),(1,(1,0)),(2,(1,0))]

The above looks sane but re-ordering the alternative causes the match to
fail:

>> "searchme" =~ "s(^|())e" :: Array Int (MatchOffset,MatchLength)
> array (1,0) []

And the problems with ^ and $ are not the only ones:

>> "searchme" =~ "((s)|(e)|(a))*" :: Array Int (MatchOffset,MatchLength)
> array (0,4) [(0,(0,3)),(1,(2,1)),(2,(-1,0)),(3,(-1,0)),(4,(2,1))]

The above is correct, but this is not:

>> "searchme" =~ "((s)|(e)|())*" :: Array Int (MatchOffset,MatchLength)
> array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1)),(4,(2,0))]