Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Nov 2016 22:49:51 +0100
From:      Stefan Bethke <stb@lassitu.de>
To:        Baptiste Daroussin <bapt@FreeBSD.org>
Cc:        Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <DFC60C4E-2116-474F-82DD-DED10518970F@lassitu.de>
In-Reply-To: <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de> <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
>=20
>> But under what circumstances would [A-Z] mean anything other than a =
character whose Unicode codepoint is between U+0041 and U+005A, =
inclusive?  Especially given the locale in the example is en_US.UTF-8.  =
Or, put another way, why would an implementation interpret [A-Z] as =
anything other than [ABCDE=E2=80=A6XYZ]?
>=20
> The collation rules for unicode comes from: http://cldr.unicode.org/ =
and they do
> match the one on linux for example and the one on illumos.
>=20
> On some gnu tool they explicitly decide to be non locale aware to =
avoid that
> kind of "surprises"
>>=20
>> =46rom reading your reference, I can see in 9.3.5.7:
>>> In the POSIX locale, a range expression represents the set of =
collating elements that fall between two elements in the collation =
sequence, inclusive. In other locales, a range expression has =
unspecified behavior[=E2=80=A6]
>>=20
>> So even if the observed behaviour is conforming, I=E2=80=99d think =
it=E2=80=99s still highly undesirable.
>>=20
> That works for POSIX locale aka C aka ASCII only world

So what do I set my LANG and LC variables to?  I do want UTF-8, but I do =
also want my scripts to continue to work.  Clearly, en_US.UTF-8 is not =
what I want.  Is it C.UTF-8?  Or do I set LANG=3Den_US.UTF-8 and =
LC_COLLATE=3DC?


Stefan

--=20
Stefan Bethke <stb@lassitu.de>   Fon +49 151 14070811







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DFC60C4E-2116-474F-82DD-DED10518970F>