Date: Fri, 27 Sep 2024 09:17:27 +0000 From: bugzilla-noreply@freebsd.org To: standards@FreeBSD.org Subject: [Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7) Message-ID: <bug-281710-99-63oKTUfjgO@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-281710-99@https.bugs.freebsd.org/bugzilla/> References: <bug-281710-99@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D281710 --- Comment #13 from Eric <erichanskrs@gmail.com> --- (in reply to Kyle Evans comment #10) (in reply to Olivier Certner comment #12) Based on the commit comments=20 https://cgit.freebsd.org/src/commit/?id=3D8f7ed58a15556bf567ff876e1999e4fe4= d684e1d however, I see that I may have underestimated the possible veracious impact= on string processing in a pervasive UTF-8 world. I haven't a test setup available at the moment to test the examples below on -CURRENT or -STABLE-13 or 14 -- Examples [1] # cat names cedric =C3=A9tienne =C3=A9gards fran=C3=A7ois [2] # cat names | grep '[=C3=A9]' =C3=A9tienne =C3=A9gards [3] # cat names | grep '[=C3=A9=C3=A7]' =C3=A9tienne =C3=A9gards fran=C3=A7ois [4] # cat names | grep '[=C3=A9i]' # <-- error cedric =C3=A9tienne fran=C3=A7ois [5] # cat names | grep -i '[=C3=A9i]' # <-- case-insensitive "avoids" sin= gleton=20 cedric =C3=A9tienne =C3=A9gards fran=C3=A7ois [6] # cat names | grep -E '[=C3=A9]|[i]' # <-- splitting in two bracket exp= ressions avoids errroneous code cedric =C3=A9tienne =C3=A9gards fran=C3=A7ois [7] # I think such cases likely will have been overlooked, misjudged as correctly processed or not investigated further. Fast & correct (UTF-8) string processing is difficult and this made me have another look at singleton's char processing.=20 Viewing from a distance (and assuming one test operation (the first only) in the string of "shortcut" ||-operands), the distance to the prize (i.e. line 1626) in https://github.com/freebsd/freebsd-src/blob/main/lib/libc/regex/regcomp.c#L= 1626=20 as compared to https://github.com/freebsd/freebsd-src/blob/releng/14.1/lib/libc/regex/regc= omp.c#L1600 has gone up considerably: singleton-error: 2 tests singleton-modified: 6 tests Are the added complexity and extra processing steps of an added singleton function for a bracket expression still justified? Case-insensitive bracket expressions don't profit, as can be painfully obse= rved in the examples above; they just add a certain small amount of additional t= ime. I wonder if comparitive testing with singleton processing versus without it yields justifiable gains=E2=80=94yes, that is a subjective adjective. --=20 You are receiving this mail because: You are on the CC list for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-281710-99-63oKTUfjgO>