From owner-freebsd-stable@freebsd.org Sun Nov 6 21:27:33 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 07756C332A2 for ; Sun, 6 Nov 2016 21:27:33 +0000 (UTC) (envelope-from baptiste.daroussin@gmail.com) Received: from mail-wm0-x235.google.com (mail-wm0-x235.google.com [IPv6:2a00:1450:400c:c09::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8B385F11 for ; Sun, 6 Nov 2016 21:27:32 +0000 (UTC) (envelope-from baptiste.daroussin@gmail.com) Received: by mail-wm0-x235.google.com with SMTP id f82so84999842wmf.1 for ; Sun, 06 Nov 2016 13:27:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=ebGarL8QAxEMUIg7LRJxb7GeFyBE6Vwjk9/d7u+/xog=; b=vdwrBVDFEzTQYO3jIPgD6h0Fb+vDfEZwI/KaBC+z4WkElHy/tXygXswNZWTp9ZnXIQ 39Zqi5i9CspWm92DzxNx+Xtv64o0sOTwgc4TlGPU+ZKZHpoYvaDOT/C7/CpXyO68JX2n 934AAPTaopDKQYMOarrzsxbkD2AeF0dWq1hnuxClWpXUUtDQG9L4KCe9UrcC6a9qb6Uz UZRw1F6pJydenB3Qq3h20sJ3KD/zKOELRd+4Rp+Cj5E0GD+fNyh0W7Xk4/mObTqiarFy rK/CSxJC9Er+F+nNBewgW1l8QS+IBmPkEnAQjvX4OW4sjAWXi2JgyjX353/+D7t9+1dH txDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=ebGarL8QAxEMUIg7LRJxb7GeFyBE6Vwjk9/d7u+/xog=; b=C2eBt3j9YzH5ex5+pxBbpPeJ46NZcu/QBerbQuXwuOAvcrgxh4fvZdCziI95yjk5W4 CJCrOl6WLe05KL813Mk/WCPJG7PurPNT4SY8H5lqtAfKlaTeSZIun6nYioBFU7FUKEju CyOZ4HOWlidEtdTPmPIRJTPRTrGsualVvltWquK/3/EvyeBGCi8d8qoG9oIa84Pi+rk1 6V1byAuLB+dx6fMWHNxiCHmofWWcZZmbe8gX9ObNJeCFrBm0EPoumO+lL9Ke8ij8tDfD bmzZzP/xnZqgeV9h7DNOBRVZzdRVPgqYo4sGBwLmAMQRr9qBsHqEn6dQNTHo78IvC5xf 4KDA== X-Gm-Message-State: ABUngvcv39XOW66BVIqh8NebMYDFqXNxNm6y784A6fQOg4mbtOhgG5XvwiBzlgkwzdC31w== X-Received: by 10.28.188.87 with SMTP id m84mr3728467wmf.14.1478467651007; Sun, 06 Nov 2016 13:27:31 -0800 (PST) Received: from ivaldir.etoilebsd.net ([2001:41d0:8:db4c::1]) by smtp.gmail.com with ESMTPSA id p184sm9670676wmg.3.2016.11.06.13.27.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 06 Nov 2016 13:27:30 -0800 (PST) Sender: Baptiste Daroussin Date: Sun, 6 Nov 2016 22:27:29 +0100 From: Baptiste Daroussin To: Stefan Bethke Cc: Greg Rivers , freebsd-stable@freebsd.org Subject: Re: Uppercase RE matching problems in FreeBSD 11 Message-ID: <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net> References: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="emwtdvp3diybumk6" Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20161104 (1.7.1) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Nov 2016 21:27:33 -0000 --emwtdvp3diybumk6 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Nov 06, 2016 at 10:20:54PM +0100, Stefan Bethke wrote: >=20 > > Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin : > >=20 > > On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote: > >>=20 > >>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin : > >>>=20 > >>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: > >>>> I happened to run an old script today that uses sed(1) to extract th= e system > >>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer = works as > >>>> expected: > >>>>=20 > >>>> $ sysctl kern.boottime > >>>> kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov 5 16= :18:34 2016 > >>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' > >>>> v 5 16:18:34 2016 > >>>>=20 > >>>> sed passes over 'S' and 'N' until it hits 'v', which it considers up= percase > >>>> apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it w= orks as > >>>> expected: > >>>>=20 > >>>> $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/' > >>>> Nov 5 16:18:34 2016 > >>>>=20 > >>>> Testing every lowercase character separately gives even more inconsi= stent > >>>> results: > >>>>=20 > >>>> $ cat < >>=20 > >>>> Here sed thinks every lowercase character except for 'a' is uppercas= e! This > >>>> differs from the first test where sed did not think 'o' is uppercase= =2E Again, > >>>> the above behaves as expected with LANG=3DC. > >>>>=20 > >>>> Does anyone have any insight into this? This is likely to break a lo= t of > >>>> existing code. > >>>>=20 > >>>=20 > >>> Yes A-Z only means uppercase in an ASCII only world in a unicode worl= d it means > >>> AaBb... Z because there are way more characters that simple A-Z. In F= reeBSD 11 > >>> we have a unicode collation instead of falling back in on LC_COLLATE= =3DC which > >>> means ascii only > >>>=20 > >>> For regrexp for example one should use the classes: :upper: or :lower= :. > >>=20 > >> That is rather surprising. Is there a normative reference for the tre= atment of bracket expressions and character classes when using locales othe= r than C and/or encodings like UTF-8? > >=20 > > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html > >=20 > > For example: > >=20 > > "Regular expressions are a context-independent syntax that can represen= t a wide > > variety of character sets and character set orderings, where these char= acter > > sets are interpreted according to the current locale. While many regular > > expressions can be interpreted differently depending on the current loc= ale, many > > features, such as character class expressions, provide for contextual i= nvariance > > across locales.=E2=80=9C >=20 > Sorry, maybe I wasn=E2=80=99t clear enough with my question. When a char= acter class fits the problem, it is clearly advantageous. >=20 > But under what circumstances would [A-Z] mean anything other than a chara= cter whose Unicode codepoint is between U+0041 and U+005A, inclusive? Espe= cially given the locale in the example is en_US.UTF-8. Or, put another way= , why would an implementation interpret [A-Z] as anything other than [ABCDE= =E2=80=A6XYZ]? The collation rules for unicode comes from: http://cldr.unicode.org/ and th= ey do match the one on linux for example and the one on illumos. On some gnu tool they explicitly decide to be non locale aware to avoid that kind of "surprises" >=20 > From reading your reference, I can see in 9.3.5.7: > > In the POSIX locale, a range expression represents the set of collating= elements that fall between two elements in the collation sequence, inclusi= ve. In other locales, a range expression has unspecified behavior[=E2=80=A6] >=20 > So even if the observed behaviour is conforming, I=E2=80=99d think it=E2= =80=99s still highly undesirable. >=20 That works for POSIX locale aka C aka ASCII only world Best regards, Bapt --emwtdvp3diybumk6 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJYH6BBAAoJEGOJi9zxtz5a4WMQANQEyjEiHzLFm+PjecLD9c2C ZRpksfh/wypquEiHre6+OsQ3fVrLf2u82XJ6Drq/89sQFWovVIKuOvN7TnmAuDp/ xlpqgh1MW2svfsJqAWGgi5dhC9H7ayqpZRJG5Sdo0kobZq0EdPS3bAR15SCoKEWT PQBX8Kx4CF1v+5f9VsmJvY7T+0YpgtFHUxBiqwfwm1d3GxQ0wrJ9TPhSB42XCcYT f6rh38x/yrSgjQ9S8LdZ6C/0bBPjEUJX8GHKubCOjvIk6JpRZ/z1QTbvpdUNyldG KzkYemFCrCpz1pEBgQE2LVslrAjmLBKG6F2QMLcPdE0RGhBX1/pO378noxLkQb2h Z54J7PtirZ7JjdsvE/KZcKEoGNWYUJGEZvO4OFVKJ0MysBo7lOLEv4MmAHRfWR33 eu4oTNvvBCR+NP28TybqboWfiO9+9ZUuc6S/k4ShyPXwGkTgPvIvQiWp49m2U1hk mFOVtg5TXWzARcWYso83MepmB4dM9eS56j/jcQ33lHoTSnzSPT16KOInp713R5KW XkZQf5LFzjpVObyLjL/c5i9hYAzKxKT44Z4DrwDjp+x4byjwK1HTLmFOA0LT2Ncq mHYlXJ3B7xvXtFHrgozdWh3df0GeiBMkJTDaRPlWbqFQj5qZ6THgiQSa2kb/8gm1 73E2KsvFIkUP86x4aH1I =5UHd -----END PGP SIGNATURE----- --emwtdvp3diybumk6--