From owner-freebsd-stable@freebsd.org Sun Nov 6 21:20:57 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 587FEC34F3C for ; Sun, 6 Nov 2016 21:20:57 +0000 (UTC) (envelope-from stb@lassitu.de) Received: from gilb.zs64.net (gilb.zs64.net [IPv6:2a00:14b0:4200:32e0::1ea]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gilb.zs64.net", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E62D6B8C; Sun, 6 Nov 2016 21:20:56 +0000 (UTC) (envelope-from stb@lassitu.de) Received: by gilb.zs64.net (Postfix, from stb@lassitu.de) id 6C90A1E2396; Sun, 6 Nov 2016 21:20:55 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.1 \(3251\)) Subject: Re: Uppercase RE matching problems in FreeBSD 11 From: Stefan Bethke In-Reply-To: <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> Date: Sun, 6 Nov 2016 22:20:54 +0100 Cc: Greg Rivers , freebsd-stable@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> To: Baptiste Daroussin X-Mailer: Apple Mail (2.3251) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Nov 2016 21:20:57 -0000 > Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin : >=20 > On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote: >>=20 >>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin = : >>>=20 >>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >>>> I happened to run an old script today that uses sed(1) to extract = the system >>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer = works as >>>> expected: >>>>=20 >>>> $ sysctl kern.boottime >>>> kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov 5 = 16:18:34 2016 >>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' >>>> v 5 16:18:34 2016 >>>>=20 >>>> sed passes over 'S' and 'N' until it hits 'v', which it considers = uppercase >>>> apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it = works as >>>> expected: >>>>=20 >>>> $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/' >>>> Nov 5 16:18:34 2016 >>>>=20 >>>> Testing every lowercase character separately gives even more = inconsistent >>>> results: >>>>=20 >>>> $ cat <>=20 >>>> Here sed thinks every lowercase character except for 'a' is = uppercase! This >>>> differs from the first test where sed did not think 'o' is = uppercase. Again, >>>> the above behaves as expected with LANG=3DC. >>>>=20 >>>> Does anyone have any insight into this? This is likely to break a = lot of >>>> existing code. >>>>=20 >>>=20 >>> Yes A-Z only means uppercase in an ASCII only world in a unicode = world it means >>> AaBb... Z because there are way more characters that simple A-Z. In = FreeBSD 11 >>> we have a unicode collation instead of falling back in on = LC_COLLATE=3DC which >>> means ascii only >>>=20 >>> For regrexp for example one should use the classes: :upper: or = :lower:. >>=20 >> That is rather surprising. Is there a normative reference for the = treatment of bracket expressions and character classes when using = locales other than C and/or encodings like UTF-8? >=20 > = http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html >=20 > For example: >=20 > "Regular expressions are a context-independent syntax that can = represent a wide > variety of character sets and character set orderings, where these = character > sets are interpreted according to the current locale. While many = regular > expressions can be interpreted differently depending on the current = locale, many > features, such as character class expressions, provide for contextual = invariance > across locales.=E2=80=9C Sorry, maybe I wasn=E2=80=99t clear enough with my question. When a = character class fits the problem, it is clearly advantageous. But under what circumstances would [A-Z] mean anything other than a = character whose Unicode codepoint is between U+0041 and U+005A, = inclusive? Especially given the locale in the example is en_US.UTF-8. = Or, put another way, why would an implementation interpret [A-Z] as = anything other than [ABCDE=E2=80=A6XYZ]? =46rom reading your reference, I can see in 9.3.5.7: > In the POSIX locale, a range expression represents the set of = collating elements that fall between two elements in the collation = sequence, inclusive. In other locales, a range expression has = unspecified behavior[=E2=80=A6] So even if the observed behaviour is conforming, I=E2=80=99d think = it=E2=80=99s still highly undesirable. Stefan --=20 Stefan Bethke Fon +49 151 14070811