Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 Nov 2016 12:07:03 -0800
From:      Chuck Swiger <cswiger@mac.com>
To:        Stefan Ehmann <shoesoft@gmx.net>
Cc:        Stefan Bethke <stb@lassitu.de>, freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <81CABF69-8B12-40D8-9E65-CCF5D183441F@mac.com>
In-Reply-To: <e314f4b4-6e02-28db-1e51-a499b4c55cde@gmx.net>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de> <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net> <DFC60C4E-2116-474F-82DD-DED10518970F@lassitu.de> <99E209EA-75B0-430D-8F0C-E51D614143BA@mac.com> <e314f4b4-6e02-28db-1e51-a499b4c55cde@gmx.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Nov 8, 2016, at 11:54 AM, Stefan Ehmann <shoesoft@gmx.net> wrote:
> On 07.11.2016 22:13, Charles Swiger wrote:
>> On Nov 6, 2016, at 1:49 PM, Stefan Bethke <stb@lassitu.de> wrote:
>>> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin
>>> <bapt@FreeBSD.org>:
>>>> That works for POSIX locale aka C aka ASCII only world
>>>=20
>>> So what do I set my LANG and LC variables to?  I do want UTF-8, but
>>> I do also want my scripts to continue to work.  Clearly,
>>> en_US.UTF-8 is not what I want.  Is it C.UTF-8?  Or do I set
>>> LANG=3Den_US.UTF-8 and LC_COLLATE=3DC?
>>=20
>> If you want to use a UTF8 locale, then you must start using character
>> classes like '[:upper:]' and '[:lower:]' because those will-- or at
>> least "should", modulo bugs-- properly handle the collation issues
>> including for languages which do not possess a 1-1 mapping between
>> upper and lower case letters.
>>=20
>> Someone with a German email address is presumably familiar with =C3=9F =
/
>> Eszett...?  :-)
>=20
> Character classes work fine for [a-z], but I don't know of a simple =
way
> to match a range like [a-k].

True.  If you need smaller ranges, I don't see a portable way of doing
so in a non-POSIX / "C" locale beyond listing them out.  Or:

> Personally, I prefer the "Rational Range Interpretation" because it
> doesn't break backward compatibility and is still standard compliant.

...yes, +1.  Many of the GNU tools like grep and gawk have adopted this,
but they are replacing the system regex routines with their own code.

However, you can't rely on RRI without testing whether you've got a gawk
in the $PATH or whether /usr/bin/awk or whichever is really GNU awk.

Regards,
--=20
-Chuck




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?81CABF69-8B12-40D8-9E65-CCF5D183441F>