Date: Sun, 6 Nov 2016 22:14:50 +0100 From: Stefan Ehmann <shoesoft@gmx.net> To: Stefan Bethke <stb@lassitu.de>, Baptiste Daroussin <bapt@FreeBSD.org> Cc: Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org Subject: Re: Uppercase RE matching problems in FreeBSD 11 Message-ID: <a3f401a7-9dc9-d567-bf21-139364702599@gmx.net> In-Reply-To: <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> References: <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06.11.2016 21:57, Stefan Bethke wrote: > >> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin >> <bapt@FreeBSD.org>: >> >> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >>> I happened to run an old script today that uses sed(1) to extract >>> the system boot time from the kern.boottime sysctl MIB. On 11.0 >>> this no longer works as expected: .. >>> Here sed thinks every lowercase character except for 'a' is >>> uppercase! This differs from the first test where sed did not >>> think 'o' is uppercase. Again, the above behaves as expected with >>> LANG=C. >>> >>> Does anyone have any insight into this? This is likely to break a >>> lot of existing code. >>> >> >> Yes A-Z only means uppercase in an ASCII only world in a unicode >> world it means AaBb... Z because there are way more characters that >> simple A-Z. In FreeBSD 11 we have a unicode collation instead of >> falling back in on LC_COLLATE=C which means ascii only >> >> For regrexp for example one should use the classes: :upper: or >> :lower:. > > That is rather surprising. Is there a normative reference for the > treatment of bracket expressions and character classes when using > locales other than C and/or encodings like UTF-8? I found an interesting article about this issue in gawk: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html Apparently the meaning of ranges is unspecified outside the "C" locale. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 says: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a3f401a7-9dc9-d567-bf21-139364702599>