Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Nov 2016 22:20:54 +0100
From:      Stefan Bethke <stb@lassitu.de>
To:        Baptiste Daroussin <bapt@FreeBSD.org>
Cc:        Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de>
In-Reply-To: <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net>

index | next in thread | previous in thread | raw e-mail


> Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
> 
> On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
>> 
>>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
>>> 
>>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>>> I happened to run an old script today that uses sed(1) to extract the system
>>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as
>>>> expected:
>>>> 
>>>> $ sysctl kern.boottime
>>>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
>>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>>>> v  5 16:18:34 2016
>>>> 
>>>> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
>>>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
>>>> expected:
>>>> 
>>>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>>>> Nov  5 16:18:34 2016
>>>> 
>>>> Testing every lowercase character separately gives even more inconsistent
>>>> results:
>>>> 
>>>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/‚p
>> 
>>>> Here sed thinks every lowercase character except for 'a' is uppercase! This
>>>> differs from the first test where sed did not think 'o' is uppercase. Again,
>>>> the above behaves as expected with LANG=C.
>>>> 
>>>> Does anyone have any insight into this? This is likely to break a lot of
>>>> existing code.
>>>> 
>>> 
>>> Yes A-Z only means uppercase in an ASCII only world in a unicode world it means
>>> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11
>>> we have a unicode collation instead of falling back in on LC_COLLATE=C which
>>> means ascii only
>>> 
>>> For regrexp for example one should use the classes: :upper: or :lower:.
>> 
>> That is rather surprising.  Is there a normative reference for the treatment of bracket expressions and character classes when using locales other than C and/or encodings like UTF-8?
> 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
> 
> For example:
> 
> "Regular expressions are a context-independent syntax that can represent a wide
> variety of character sets and character set orderings, where these character
> sets are interpreted according to the current locale. While many regular
> expressions can be interpreted differently depending on the current locale, many
> features, such as character class expressions, provide for contextual invariance
> across locales.“

Sorry, maybe I wasn’t clear enough with my question.  When a character class fits the problem, it is clearly advantageous.

But under what circumstances would [A-Z] mean anything other than a character whose Unicode codepoint is between U+0041 and U+005A, inclusive?  Especially given the locale in the example is en_US.UTF-8.  Or, put another way, why would an implementation interpret [A-Z] as anything other than [ABCDE…XYZ]?

From reading your reference, I can see in 9.3.5.7:
> In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior[…]

So even if the observed behaviour is conforming, I’d think it’s still highly undesirable.


Stefan

-- 
Stefan Bethke <stb@lassitu.de>   Fon +49 151 14070811






help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C4BC6673-2E07-45E6-81D6-EB4FF99605A8>