Date: Mon, 23 Aug 2010 12:51:40 +0100 From: chrisk-freebsd@list.mightyreason.com To: freebsd-current@freebsd.org, gabor@FreeBSD.org Subject: grep and Regular expression correctness Message-ID: <4C7260CC.9070407@list.mightyreason.com>
next in thread | raw e-mail | index | archive | help
I saw the discussion on this list with the subject "why GNU grep is fast" and I thought I would contribute to a discussion of the correctness of the POSIX regular expressions. Gabor wrote: >> I believe Gabor is considering TRE for a good replacement regex library. > Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax > but not standard regex and Plan 9 implementation iirc only supports > fgrep syntax and Unicode but not wchar_t in general. I specifically need to mention that the ideas used in the TRE algorithm can work, but that libtre is very buggy (see footnote [1]). The hardest part of POSIX regular expressions is in picking out the correct captured subexpressions. This makes programs like "sed" vulnerable to the bad regex.h engine that comes with the operating system. Luckily grep does not need the captured subexpressions, and thus does not need the complexity that comes from the ideas behind TRE. I use OS X which inherits an older freebsd suite of tools with bugs like this: > echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' [XababaYababaZ] => [X][ab][aba][Y][b][Z] Where the [b] between [Y] and [Z] is completely (insanely) wrong. It cannot even get the leftmost-longest correct: > echo "ab" | sed -E 's/(()|.)(b)/[\1][\3]/' a[][b] Where the answer should have been the same as > echo "ab" | sed -E 's/(.|())(b)/[\1][\3]/' [a][b] I have no idea if these bugs are still present in freebsd-current. Cheers, Chris Kuklewicz [1] libtre has problems such as this (from Haskell's regex-tre wrapper) >> "searchme" =~ "s(()|^)e" :: Array Int (MatchOffset,MatchLength) > array (0,2) [(0,(0,2)),(1,(1,0)),(2,(1,0))] The above looks sane but re-ordering the alternative causes the match to fail: >> "searchme" =~ "s(^|())e" :: Array Int (MatchOffset,MatchLength) > array (1,0) [] And the problems with ^ and $ are not the only ones: >> "searchme" =~ "((s)|(e)|(a))*" :: Array Int (MatchOffset,MatchLength) > array (0,4) [(0,(0,3)),(1,(2,1)),(2,(-1,0)),(3,(-1,0)),(4,(2,1))] The above is correct, but this is not: >> "searchme" =~ "((s)|(e)|())*" :: Array Int (MatchOffset,MatchLength) > array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1)),(4,(2,0))]
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C7260CC.9070407>