Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Sep 2024 09:17:27 +0000
From:      bugzilla-noreply@freebsd.org
To:        standards@FreeBSD.org
Subject:   [Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)
Message-ID:  <bug-281710-99-63oKTUfjgO@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-281710-99@https.bugs.freebsd.org/bugzilla/>
References:  <bug-281710-99@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D281710

--- Comment #13 from Eric <erichanskrs@gmail.com> ---
(in reply to Kyle Evans comment #10)
(in reply to  Olivier Certner comment #12)

Based on the commit comments=20
https://cgit.freebsd.org/src/commit/?id=3D8f7ed58a15556bf567ff876e1999e4fe4=
d684e1d
however, I see that I may have underestimated the possible veracious impact=
 on
string processing in a pervasive UTF-8 world.

I haven't a test setup available at the moment to test the examples below on
-CURRENT or -STABLE-13 or 14

-- Examples
[1] # cat names
cedric
=C3=A9tienne
=C3=A9gards
fran=C3=A7ois
[2] # cat names | grep '[=C3=A9]'
=C3=A9tienne
=C3=A9gards
[3] # cat names | grep '[=C3=A9=C3=A7]'
=C3=A9tienne
=C3=A9gards
fran=C3=A7ois
[4] # cat names | grep '[=C3=A9i]'      # <-- error
cedric
=C3=A9tienne
fran=C3=A7ois
[5] # cat names | grep -i '[=C3=A9i]'   # <-- case-insensitive "avoids" sin=
gleton=20
cedric
=C3=A9tienne
=C3=A9gards
fran=C3=A7ois
[6] # cat names | grep -E '[=C3=A9]|[i]' # <-- splitting in two bracket exp=
ressions
avoids errroneous code
cedric
=C3=A9tienne
=C3=A9gards
fran=C3=A7ois
[7] #

I think such cases likely will have been overlooked, misjudged as correctly
processed or not investigated further.

Fast & correct (UTF-8) string processing is difficult and this made me have
another look at singleton's char processing.=20
Viewing from a distance (and assuming one test operation (the first only) in
the string of "shortcut" ||-operands), the distance to the prize (i.e. line
1626) in
https://github.com/freebsd/freebsd-src/blob/main/lib/libc/regex/regcomp.c#L=
1626=20
as compared to
https://github.com/freebsd/freebsd-src/blob/releng/14.1/lib/libc/regex/regc=
omp.c#L1600
has gone up considerably:
singleton-error:      2 tests
singleton-modified:   6 tests

Are the added complexity and extra processing steps of an added singleton
function for a bracket expression still justified?
Case-insensitive bracket expressions don't profit, as can be painfully obse=
rved
in the examples above; they just add a certain small amount of additional t=
ime.
I wonder if comparitive testing with singleton processing versus without it
yields justifiable gains=E2=80=94yes, that is a subjective adjective.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-281710-99-63oKTUfjgO>