Date: Sun, 6 Nov 2016 21:57:00 +0100 From: Stefan Bethke <stb@lassitu.de> To: Baptiste Daroussin <bapt@FreeBSD.org> Cc: Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org Subject: Re: Uppercase RE matching problems in FreeBSD 11 Message-ID: <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> In-Reply-To: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> References: <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net>
next in thread | previous in thread | raw e-mail | index | archive | help
> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
>=20
> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>> I happened to run an old script today that uses sed(1) to extract the =
system
>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer =
works as
>> expected:
>>=20
>> $ sysctl kern.boottime
>> kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov  5 =
16:18:34 2016
>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>> v  5 16:18:34 2016
>>=20
>> sed passes over 'S' and 'N' until it hits 'v', which it considers =
uppercase
>> apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it =
works as
>> expected:
>>=20
>> $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/'
>> Nov  5 16:18:34 2016
>>=20
>> Testing every lowercase character separately gives even more =
inconsistent
>> results:
>>=20
>> $ cat <<! | LANG=3Den_US.UTF-8 sed -n -e '/^[A-Z]$/=E2=80=9Ap
>> Here sed thinks every lowercase character except for 'a' is =
uppercase! This
>> differs from the first test where sed did not think 'o' is uppercase. =
Again,
>> the above behaves as expected with LANG=3DC.
>>=20
>> Does anyone have any insight into this? This is likely to break a lot =
of
>> existing code.
>>=20
>=20
> Yes A-Z only means uppercase in an ASCII only world in a unicode world =
it means
> AaBb... Z because there are way more characters that simple A-Z. In =
FreeBSD 11
> we have a unicode collation instead of falling back in on LC_COLLATE=3DC=
 which
> means ascii only
>=20
> For regrexp for example one should use the classes: :upper: or =
:lower:.
That is rather surprising.  Is there a normative reference for the =
treatment of bracket expressions and character classes when using =
locales other than C and/or encodings like UTF-8?
Stefan
--=20
Stefan Bethke <stb@lassitu.de>   Fon +49 151 14070811
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?29451103-E8DB-4656-A5BB-AEB924A728D6>
