Date: Tue, 13 Jan 2009 13:50:58 GMT From: Chris Kuklewicz <bugreport-freebsd@mightyreason.com> To: freebsd-gnats-submit@FreeBSD.org Subject: misc/130504: Serious bug in regular expression library (regex) affected sed Message-ID: <200901131350.n0DDowGe025190@www.freebsd.org> Resent-Message-ID: <200901131400.n0DE01Uh063223@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 130504 >Category: misc >Synopsis: Serious bug in regular expression library (regex) affected sed >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Tue Jan 13 14:00:01 UTC 2009 >Closed-Date: >Last-Modified: >Originator: Chris Kuklewicz >Release: FreeBSD 8.0-Current and 6.3-PRERELEASE >Organization: Not applicable >Environment: FreeBSD m-net.arbornet.org 6.3-PRERELEASE FreeBSD 6.3-PRERELEASE #1: Tue Nov 6 16:13:56 EST 2007 root@newmnet.arbornet.org:/usr/src/sys/i386/compile/MNET i386 >Description: I have tracked the source of a bug originally seen on OS X back to the same bug in FreeBSD. The bug is in the libc implementation of regular expressions, used via the API in "regex.h" specifically regexec. Programs which use this library are affected, including "sed" which I will use to illustrate the bug below. The bug effects both the whole text matched and the choice of text reported as matching parenthesized subexpressions. The bug violates the leftmost longest matching rule for POSIX regular expressions. The bug violates the principle that the order of subexpressions around the "|" operator should not affect the whole text matched. A terminal sessions using three sed commands with extended regular expressions: [chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/" a[b] [chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/" [ab] [chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' [XababaYababaZ] => [X][ab][aba][Y][b][Z] The first two commands differ only in the order of "()" and "." around the "|" operator. According to the POSIX specification this must not affect the whole match, which is report by "&" in sed's replacement specification. The leftmost longest match is "ab" and is correctly reported by the second command. The first command incorrect does not match starting with the "a" and matches just the "b". This establishes that the bug violates the POSIX specification and can affect a program that only cares about the whole text matched (ignoring captured parenthesized subexpressions). The third command shows an internal inconsistency in the output. The (aba|b|ab)* operator should capture the text matched by the last repetition of the (aba|b|ab) subexpression and report that in \6. In particular \6 should match "aba" just like \3 does. As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z] The correct output should have been [XababaYababaZ] => [X][ab][aba][Y][abc][Z] Could there be a way to explain the FreeBSD output? Consider the following questions: If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the text? If the "b" reported by \6 is the last "b" in the text then there is no part of the pattern that can match the next "a". If the "b" reported by \6 is the next-to-last "b" in the text then there is no part of the pattern that can match the preceding "a". Checking the indices returned by regex shows that \6 was matched against the last "b" in the text. Thus reporting \6 as matching "b" is contradictory, there is no consistent interpretation of what regexec returned. This establishes that the bug is can violate not only the POSIX specification but also violate any rational and consistent alternate specification. I found the first example by using Haskell's QuickCheck to randomly generate test cases. The second example was found by using Glenn Fowler's "AT&T Labs Research regex(3) regression tests" from http://www.research.att.com/~gsf/testregex/ I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the #freebsd channel at irc.freenode.net >How-To-Repeat: [chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/" [chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/" [chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab) If there is no bug then the first two both output "[ab]" and the last output "[XababaYababaZ] => [X][ab][aba][Y][abc][Z]". Currently the bug causes the first the output "a[b]" and the last to output "[XababaYababaZ] => [X][ab][aba][Y][b][Z]". >Fix: I do not have the time to be able to fix the regex.h part of libc. >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200901131350.n0DDowGe025190>