From owner-freebsd-stable@freebsd.org  Sun Nov  6 21:20:57 2016
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 587FEC34F3C
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Sun,  6 Nov 2016 21:20:57 +0000 (UTC) (envelope-from stb@lassitu.de)
Received: from gilb.zs64.net (gilb.zs64.net [IPv6:2a00:14b0:4200:32e0::1ea])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "gilb.zs64.net", Issuer "Let's Encrypt Authority X3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id E62D6B8C;
 Sun,  6 Nov 2016 21:20:56 +0000 (UTC) (envelope-from stb@lassitu.de)
Received: by gilb.zs64.net (Postfix, from stb@lassitu.de) id 6C90A1E2396;
 Sun,  6 Nov 2016 21:20:55 +0000 (UTC)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 10.1 \(3251\))
Subject: Re: Uppercase RE matching problems in FreeBSD 11
From: Stefan Bethke <stb@lassitu.de>
In-Reply-To: <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net>
Date: Sun, 6 Nov 2016 22:20:54 +0100
Cc: Greg Rivers <gcr+freebsd-stable@tharned.org>,
 freebsd-stable@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de>
References: <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org>
 <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net>
 <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de>
 <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net>
To: Baptiste Daroussin <bapt@FreeBSD.org>
X-Mailer: Apple Mail (2.3251)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Nov 2016 21:20:57 -0000


> Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
>=20
> On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
>>=20
>>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin =
<bapt@FreeBSD.org>:
>>>=20
>>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>>> I happened to run an old script today that uses sed(1) to extract =
the system
>>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer =
works as
>>>> expected:
>>>>=20
>>>> $ sysctl kern.boottime
>>>> kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov  5 =
16:18:34 2016
>>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>>>> v  5 16:18:34 2016
>>>>=20
>>>> sed passes over 'S' and 'N' until it hits 'v', which it considers =
uppercase
>>>> apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it =
works as
>>>> expected:
>>>>=20
>>>> $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/'
>>>> Nov  5 16:18:34 2016
>>>>=20
>>>> Testing every lowercase character separately gives even more =
inconsistent
>>>> results:
>>>>=20
>>>> $ cat <<! | LANG=3Den_US.UTF-8 sed -n -e '/^[A-Z]$/=E2=80=9Ap
>>=20
>>>> Here sed thinks every lowercase character except for 'a' is =
uppercase! This
>>>> differs from the first test where sed did not think 'o' is =
uppercase. Again,
>>>> the above behaves as expected with LANG=3DC.
>>>>=20
>>>> Does anyone have any insight into this? This is likely to break a =
lot of
>>>> existing code.
>>>>=20
>>>=20
>>> Yes A-Z only means uppercase in an ASCII only world in a unicode =
world it means
>>> AaBb... Z because there are way more characters that simple A-Z. In =
FreeBSD 11
>>> we have a unicode collation instead of falling back in on =
LC_COLLATE=3DC which
>>> means ascii only
>>>=20
>>> For regrexp for example one should use the classes: :upper: or =
:lower:.
>>=20
>> That is rather surprising.  Is there a normative reference for the =
treatment of bracket expressions and character classes when using =
locales other than C and/or encodings like UTF-8?
>=20
> =
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
>=20
> For example:
>=20
> "Regular expressions are a context-independent syntax that can =
represent a wide
> variety of character sets and character set orderings, where these =
character
> sets are interpreted according to the current locale. While many =
regular
> expressions can be interpreted differently depending on the current =
locale, many
> features, such as character class expressions, provide for contextual =
invariance
> across locales.=E2=80=9C

Sorry, maybe I wasn=E2=80=99t clear enough with my question.  When a =
character class fits the problem, it is clearly advantageous.

But under what circumstances would [A-Z] mean anything other than a =
character whose Unicode codepoint is between U+0041 and U+005A, =
inclusive?  Especially given the locale in the example is en_US.UTF-8.  =
Or, put another way, why would an implementation interpret [A-Z] as =
anything other than [ABCDE=E2=80=A6XYZ]?

=46rom reading your reference, I can see in 9.3.5.7:
> In the POSIX locale, a range expression represents the set of =
collating elements that fall between two elements in the collation =
sequence, inclusive. In other locales, a range expression has =
unspecified behavior[=E2=80=A6]

So even if the observed behaviour is conforming, I=E2=80=99d think =
it=E2=80=99s still highly undesirable.


Stefan

--=20
Stefan Bethke <stb@lassitu.de>   Fon +49 151 14070811