Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Nov 2016 22:27:29 +0100
From:      Baptiste Daroussin <bapt@FreeBSD.org>
To:        Stefan Bethke <stb@lassitu.de>
Cc:        Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net>
In-Reply-To: <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
On Sun, Nov 06, 2016 at 10:20:54PM +0100, Stefan Bethke wrote:
> 
> > Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
> > 
> > On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
> >> 
> >>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
> >>> 
> >>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> >>>> I happened to run an old script today that uses sed(1) to extract the system
> >>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as
> >>>> expected:
> >>>> 
> >>>> $ sysctl kern.boottime
> >>>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
> >>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> >>>> v  5 16:18:34 2016
> >>>> 
> >>>> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
> >>>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
> >>>> expected:
> >>>> 
> >>>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
> >>>> Nov  5 16:18:34 2016
> >>>> 
> >>>> Testing every lowercase character separately gives even more inconsistent
> >>>> results:
> >>>> 
> >>>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/‚p
> >> 
> >>>> Here sed thinks every lowercase character except for 'a' is uppercase! This
> >>>> differs from the first test where sed did not think 'o' is uppercase. Again,
> >>>> the above behaves as expected with LANG=C.
> >>>> 
> >>>> Does anyone have any insight into this? This is likely to break a lot of
> >>>> existing code.
> >>>> 
> >>> 
> >>> Yes A-Z only means uppercase in an ASCII only world in a unicode world it means
> >>> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11
> >>> we have a unicode collation instead of falling back in on LC_COLLATE=C which
> >>> means ascii only
> >>> 
> >>> For regrexp for example one should use the classes: :upper: or :lower:.
> >> 
> >> That is rather surprising.  Is there a normative reference for the treatment of bracket expressions and character classes when using locales other than C and/or encodings like UTF-8?
> > 
> > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
> > 
> > For example:
> > 
> > "Regular expressions are a context-independent syntax that can represent a wide
> > variety of character sets and character set orderings, where these character
> > sets are interpreted according to the current locale. While many regular
> > expressions can be interpreted differently depending on the current locale, many
> > features, such as character class expressions, provide for contextual invariance
> > across locales.“
> 
> Sorry, maybe I wasn’t clear enough with my question.  When a character class fits the problem, it is clearly advantageous.
> 
> But under what circumstances would [A-Z] mean anything other than a character whose Unicode codepoint is between U+0041 and U+005A, inclusive?  Especially given the locale in the example is en_US.UTF-8.  Or, put another way, why would an implementation interpret [A-Z] as anything other than [ABCDE…XYZ]?

The collation rules for unicode comes from: http://cldr.unicode.org/ and they do
match the one on linux for example and the one on illumos.

On some gnu tool they explicitly decide to be non locale aware to avoid that
kind of "surprises"
> 
> From reading your reference, I can see in 9.3.5.7:
> > In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior[…]
> 
> So even if the observed behaviour is conforming, I’d think it’s still highly undesirable.
> 
That works for POSIX locale aka C aka ASCII only world

Best regards,
Bapt

[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----

iQIcBAABCAAGBQJYH6BBAAoJEGOJi9zxtz5a4WMQANQEyjEiHzLFm+PjecLD9c2C
ZRpksfh/wypquEiHre6+OsQ3fVrLf2u82XJ6Drq/89sQFWovVIKuOvN7TnmAuDp/
xlpqgh1MW2svfsJqAWGgi5dhC9H7ayqpZRJG5Sdo0kobZq0EdPS3bAR15SCoKEWT
PQBX8Kx4CF1v+5f9VsmJvY7T+0YpgtFHUxBiqwfwm1d3GxQ0wrJ9TPhSB42XCcYT
f6rh38x/yrSgjQ9S8LdZ6C/0bBPjEUJX8GHKubCOjvIk6JpRZ/z1QTbvpdUNyldG
KzkYemFCrCpz1pEBgQE2LVslrAjmLBKG6F2QMLcPdE0RGhBX1/pO378noxLkQb2h
Z54J7PtirZ7JjdsvE/KZcKEoGNWYUJGEZvO4OFVKJ0MysBo7lOLEv4MmAHRfWR33
eu4oTNvvBCR+NP28TybqboWfiO9+9ZUuc6S/k4ShyPXwGkTgPvIvQiWp49m2U1hk
mFOVtg5TXWzARcWYso83MepmB4dM9eS56j/jcQ33lHoTSnzSPT16KOInp713R5KW
XkZQf5LFzjpVObyLjL/c5i9hYAzKxKT44Z4DrwDjp+x4byjwK1HTLmFOA0LT2Ncq
mHYlXJ3B7xvXtFHrgozdWh3df0GeiBMkJTDaRPlWbqFQj5qZ6THgiQSa2kb/8gm1
73E2KsvFIkUP86x4aH1I
=5UHd
-----END PGP SIGNATURE-----
help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161106212729.z2edg44kg7hc4r2z>