From owner-freebsd-current@FreeBSD.ORG Mon Aug 23 12:03:01 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6081710656A7; Mon, 23 Aug 2010 12:03:01 +0000 (UTC) (envelope-from chrisk-freebsd@list.mightyreason.com) Received: from fallback-in2.mxes.net (fallback-out2.mxes.net [216.86.168.191]) by mx1.freebsd.org (Postfix) with ESMTP id 3868C8FC1B; Mon, 23 Aug 2010 12:03:01 +0000 (UTC) Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182]) by fallback-in1.mxes.net (Postfix) with ESMTP id 8D6542FDBF2; Mon, 23 Aug 2010 07:51:43 -0400 (EDT) Received: from chris-kuklewiczs-macbook-pro.local (unknown [137.195.54.21]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 2E03B22E247; Mon, 23 Aug 2010 07:51:41 -0400 (EDT) Message-ID: <4C7260CC.9070407@list.mightyreason.com> Date: Mon, 23 Aug 2010 12:51:40 +0100 From: chrisk-freebsd@list.mightyreason.com User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: freebsd-current@freebsd.org, gabor@FreeBSD.org X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Mon, 23 Aug 2010 12:26:45 +0000 Cc: Subject: grep and Regular expression correctness X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 12:03:01 -0000 I saw the discussion on this list with the subject "why GNU grep is fast" and I thought I would contribute to a discussion of the correctness of the POSIX regular expressions. Gabor wrote: >> I believe Gabor is considering TRE for a good replacement regex library. > Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax > but not standard regex and Plan 9 implementation iirc only supports > fgrep syntax and Unicode but not wchar_t in general. I specifically need to mention that the ideas used in the TRE algorithm can work, but that libtre is very buggy (see footnote [1]). The hardest part of POSIX regular expressions is in picking out the correct captured subexpressions. This makes programs like "sed" vulnerable to the bad regex.h engine that comes with the operating system. Luckily grep does not need the captured subexpressions, and thus does not need the complexity that comes from the ideas behind TRE. I use OS X which inherits an older freebsd suite of tools with bugs like this: > echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' [XababaYababaZ] => [X][ab][aba][Y][b][Z] Where the [b] between [Y] and [Z] is completely (insanely) wrong. It cannot even get the leftmost-longest correct: > echo "ab" | sed -E 's/(()|.)(b)/[\1][\3]/' a[][b] Where the answer should have been the same as > echo "ab" | sed -E 's/(.|())(b)/[\1][\3]/' [a][b] I have no idea if these bugs are still present in freebsd-current. Cheers, Chris Kuklewicz [1] libtre has problems such as this (from Haskell's regex-tre wrapper) >> "searchme" =~ "s(()|^)e" :: Array Int (MatchOffset,MatchLength) > array (0,2) [(0,(0,2)),(1,(1,0)),(2,(1,0))] The above looks sane but re-ordering the alternative causes the match to fail: >> "searchme" =~ "s(^|())e" :: Array Int (MatchOffset,MatchLength) > array (1,0) [] And the problems with ^ and $ are not the only ones: >> "searchme" =~ "((s)|(e)|(a))*" :: Array Int (MatchOffset,MatchLength) > array (0,4) [(0,(0,3)),(1,(2,1)),(2,(-1,0)),(3,(-1,0)),(4,(2,1))] The above is correct, but this is not: >> "searchme" =~ "((s)|(e)|())*" :: Array Int (MatchOffset,MatchLength) > array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1)),(4,(2,0))]