Date: Wed, 25 Sep 2024 13:30:34 +0000 From: bugzilla-noreply@freebsd.org To: standards@FreeBSD.org Subject: [Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7) Message-ID: <bug-281710-99@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D281710 Bug ID: 281710 Summary: RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7) Product: Base System Version: 14.1-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: standards Assignee: standards@FreeBSD.org Reporter: erichanskrs@gmail.com It looks like there's a bug in FreeBSD's sed(1), grep(1), re_format(7), regarding accented characters and their use in a bracket expression [^...] = in regular expressions (modern REs as well as basic REs). -- Short examples Command lines 202, 203 and 207 show unexpected bahaviour. [200] # echo '9a' | /usr/bin/sed -En 's/([^a])(a)/-\1-\2-/p' -9-a- [201] # echo '9a' | /usr/bin/sed -n 's/\([^a]\)\(a\)/-\1-\2-/p' -9-a- [202] # echo '9=C3=A2' | /usr/bin/sed -n 's/\([^=C3=A2]\)\(=C3=A2\)/= -\1-\2-/p' # <-- [203] # echo '9=C3=A2' | /usr/bin/sed -En 's/([^=C3=A2])(=C3=A2)/-\1= -\2-/p' # <-- [204] # echo '9=C3=A2' | /usr/local/bin/gsed -En 's/([^=C3=A2])(=C3=A2)/-\1= -\2-/p' -9-=C3=A2- [205] # echo '=C3=A2=C3=A2' | /usr/bin/sed -En 's/([=C3=A2])(=C3=A2)= /-\1-\2-/p' -=C3=A2-=C3=A2- [206] # echo '=C3=A2=C3=A2' | /usr/local/bin/gsed -En 's/([=C3=A2])(=C3=A2)= /-\1-\2-/p' -=C3=A2-=C3=A2- [207] # echo '9=C3=A2' | /usr/bin/grep -E '[^=C3=A2]=C3=A2' = # <-- [208] # Same results with characters like '=C3=A7' and '=C3=A9'.=20 Reported in forum thread (see link below) Unicode characters. -- Reference FreeBSD forum link: https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format= -7.95088/ re_format(7): " DESCRIPTION [...] A bracket expression is a list of characters enclosed in `[]'. It n= or- mally matches any single character from the list (but see below). = If the list begins with `^', it matches any single character (but see = be- low) not from the rest of the list. " As FreeBSD intends/tries to conform to POSIX, likewise : https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#ta= g_09_03_05 " 3. A non-matching list expression begins with a <circumflex> ('^'), and the matching behavior shall be the logical inverse of the corresponding matching list expression (the same bracket expression but without the leading <circumflex>). For example, since the RE "[abc]" only matches 'a', 'b', or = 'c', it follows that "[^abc]" is an RE that matches any character except 'a', 'b= ', or 'c'. It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions. The <circumflex> shall have this special meaning only when it occurs first in the list, immediately following the <left-square-bracket>. " -- Context of my OS and programs: [100] # uname -a FreeBSD q210 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64 [101] # pkg which /usr/local/bin/ggrep /usr/local/bin/ggrep was installed by package gnugrep-3.11 [102] # pkg which /usr/local/bin/gsed /usr/local/bin/gsed was installed by package gsed-4.9 [103] # locale LANG=3DC.UTF-8 LC_CTYPE=3D"C.UTF-8" LC_COLLATE=3D"C.UTF-8" LC_TIME=3D"C.UTF-8" LC_NUMERIC=3D"C.UTF-8" LC_MONETARY=3D"C.UTF-8" LC_MESSAGES=3D"C.UTF-8" LC_ALL=3D --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-281710-99>