Date: Mon, 2 Sep 2013 20:52:18 +0300 From: Kimmo Paasiala <kpaasial@gmail.com> To: Andriy Gapon <avg@freebsd.org> Cc: FreeBSD Current <freebsd-current@freebsd.org> Subject: Re: bug with special bracket expressions in regular expressions Message-ID: <CA%2B7WWSd0=m_4fBxTEoVzj15%2B%2B7az7WviENY6ah=39wM_R9FWPw@mail.gmail.com> In-Reply-To: <5224C08E.1070404@FreeBSD.org> References: <5224A693.3000904@FreeBSD.org> <5224C08E.1070404@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Sep 2, 2013 at 7:45 PM, Andriy Gapon <avg@freebsd.org> wrote: > on 02/09/2013 17:54 Andriy Gapon said the following: >> >> re_format(7) says: >> There are two special cases=E2=80=A1 of bracket expressions: the br= acket expres=E2=80=90 >> sions =E2=80=98[[:<:]]=E2=80=99 and =E2=80=98[[:>:]]=E2=80=99 match= the null string at the beginning and >> end of a word respectively. A word is defined as a sequence of wor= d >> characters which is neither preceded nor followed by word character= s. A >> word character is an alnum character (as defined by ctype(3)) or an >> underscore. This is an extension, compatible with but not specifie= d by >> IEEE Std 1003.2 (=E2=80=9CPOSIX.2=E2=80=9D), and should be used wit= h caution in software >> intended to be portable to other systems. >> >> However I observe the following: >> $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g' >> xx >> $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g' >> cd1 xx >> >> In my opinion '[[:<:]]' should not affect how the pattern is matched in = this case. > > It seems that the code works like this: > - first it matches "cd0 " and "removes" it > - then it passes "cd1 xx" for matching with a flag that tells that this i= s not > a real start of the string > - thus the matching code > o knows that this is not a real line start, so it can't match [[:<:]] > just for that reason > o it does _not_ know what was the character before the start of the give= n > substring, so it can not know if it could match [[:<:]] > > So matching fails. > Not sure if this is an internal problem of regex(3) or a problem of how s= ed(1) > uses regex(3). > > -- > Andriy Gapon In my opinion this is a bug. The [[:<:]] operator is said to match the empty string at the beginning of a word with no mention that the word has to be at the beginning of the whole string that is matched. OS X version of sed(1) works differently: $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g' xx $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g' xx -Kimmo
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2B7WWSd0=m_4fBxTEoVzj15%2B%2B7az7WviENY6ah=39wM_R9FWPw>