Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Sep 2024 13:30:34 +0000
From:      bugzilla-noreply@freebsd.org
To:        standards@FreeBSD.org
Subject:   [Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)
Message-ID:  <bug-281710-99@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D281710

            Bug ID: 281710
           Summary: RegEXP bug in bracket expression [^...] - sed(1),
                    grep(1), re_format(7)
           Product: Base System
           Version: 14.1-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: standards
          Assignee: standards@FreeBSD.org
          Reporter: erichanskrs@gmail.com

It looks like there's a bug in FreeBSD's sed(1), grep(1), re_format(7),
regarding accented characters and their use in a bracket expression [^...] =
in
regular expressions (modern REs as well as basic REs).


-- Short examples
Command lines 202, 203 and 207 show unexpected bahaviour.
[200] # echo '9a' | /usr/bin/sed        -En 's/([^a])(a)/-\1-\2-/p'
-9-a-
[201] # echo '9a' | /usr/bin/sed        -n 's/\([^a]\)\(a\)/-\1-\2-/p'
-9-a-
[202] # echo '9=C3=A2' | /usr/bin/sed        -n 's/\([^=C3=A2]\)\(=C3=A2\)/=
-\1-\2-/p' # <--
[203] # echo '9=C3=A2' | /usr/bin/sed        -En 's/([^=C3=A2])(=C3=A2)/-\1=
-\2-/p'    # <--
[204] # echo '9=C3=A2' | /usr/local/bin/gsed -En 's/([^=C3=A2])(=C3=A2)/-\1=
-\2-/p'
-9-=C3=A2-
[205] # echo '=C3=A2=C3=A2' | /usr/bin/sed        -En 's/([=C3=A2])(=C3=A2)=
/-\1-\2-/p'
-=C3=A2-=C3=A2-
[206] # echo '=C3=A2=C3=A2' | /usr/local/bin/gsed -En 's/([=C3=A2])(=C3=A2)=
/-\1-\2-/p'
-=C3=A2-=C3=A2-
[207] # echo '9=C3=A2' | /usr/bin/grep       -E '[^=C3=A2]=C3=A2'          =
            # <--
[208] #

Same results with characters like '=C3=A7' and '=C3=A9'.=20
Reported in forum thread (see link below) Unicode characters.


-- Reference
FreeBSD forum link:
https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format=
-7.95088/

re_format(7):
"
DESCRIPTION
   [...]
       A bracket expression is a list of characters enclosed in `[]'.  It n=
or-
       mally  matches  any single character from the list (but see below). =
 If
       the list begins with `^', it matches any single character (but see  =
be-
       low)  not from the rest of the list.
"
As FreeBSD intends/tries to conform to POSIX, likewise :
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#ta=
g_09_03_05
"
3. A non-matching list expression begins with a <circumflex> ('^'), and the
matching behavior shall be the logical inverse of the corresponding matching
list expression (the same bracket expression but without the leading
<circumflex>). For example, since the RE "[abc]" only matches 'a', 'b', or =
'c',
it follows that "[^abc]" is an RE that matches any character except 'a', 'b=
',
or 'c'. It is unspecified whether a non-matching list expression matches a
multi-character collating element that is not matched by any of the
expressions. The <circumflex> shall have this special meaning only when it
occurs first in the list, immediately following the <left-square-bracket>.
"


-- Context of my OS and programs:
[100] # uname -a
FreeBSD q210 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64
[101] # pkg which /usr/local/bin/ggrep
/usr/local/bin/ggrep was installed by package gnugrep-3.11
[102] # pkg which /usr/local/bin/gsed
/usr/local/bin/gsed was installed by package gsed-4.9
[103] # locale
LANG=3DC.UTF-8
LC_CTYPE=3D"C.UTF-8"
LC_COLLATE=3D"C.UTF-8"
LC_TIME=3D"C.UTF-8"
LC_NUMERIC=3D"C.UTF-8"
LC_MONETARY=3D"C.UTF-8"
LC_MESSAGES=3D"C.UTF-8"
LC_ALL=3D

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-281710-99>