Date: Sun, 6 Nov 2016 12:07:29 +0100 From: Baptiste Daroussin <bapt@FreeBSD.org> To: Greg Rivers <gcr+freebsd-stable@tharned.org> Cc: freebsd-stable@freebsd.org Subject: Re: Uppercase RE matching problems in FreeBSD 11 Message-ID: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> In-Reply-To: <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> References: <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--6tpwwlpjmvkdsy5z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: > I happened to run an old script today that uses sed(1) to extract the sys= tem > boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works= as > expected: >=20 > $ sysctl kern.boottime > kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov 5 16:18:3= 4 2016 > $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' > v 5 16:18:34 2016 >=20 > sed passes over 'S' and 'N' until it hits 'v', which it considers upperca= se > apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it works = as > expected: >=20 > $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/' > Nov 5 16:18:34 2016 >=20 > Testing every lowercase character separately gives even more inconsistent > results: >=20 > $ cat <<! | LANG=3Den_US.UTF-8 sed -n -e '/^[A-Z]$/'p > > a > > b > > c > > d > > e > > f > > g > > h > > i > > j > > k > > l > > m > > n > > o > > p > > q > > r > > s > > t > > u > > v > > w > > x > > y > > z > > ! > b > c > d > e > f > g > h > i > j > k > l > m > n > o > p > q > r > s > t > u > v > w > x > y > z >=20 > Here sed thinks every lowercase character except for 'a' is uppercase! Th= is > differs from the first test where sed did not think 'o' is uppercase. Aga= in, > the above behaves as expected with LANG=3DC. >=20 > Does anyone have any insight into this? This is likely to break a lot of > existing code. >=20 Yes A-Z only means uppercase in an ASCII only world in a unicode world it m= eans AaBb... Z because there are way more characters that simple A-Z. In FreeBSD= 11 we have a unicode collation instead of falling back in on LC_COLLATE=3DC wh= ich means ascii only For regrexp for example one should use the classes: :upper: or :lower:. Best regards, Bapt --6tpwwlpjmvkdsy5z Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJYHw7xAAoJEGOJi9zxtz5anyQQANztz/d2fUYBiCo5QcF3iPHn C98qrd7aqQWEXPE+hdhrqC4r82UaYNNqvaYdoaArV6WIQOqEDzu/Eju8c6VidOkj uSJuai9mAxQTzbSi8oSka8kyGGUJZYKA0wZpGfqdWTCigQcE9yjFdnVYbkIn8LNp Y4+N9ZEOm0pGDxbD7aOTCT4sZY7znqaZuoiA6Fid6jNe/dEIKnfDDoMOyUrt8YF7 v1O6RUILizjDpfs4VzrE2MmoUs5hXKREv1+rez87wLTUhj08d3h93vvQrtrzt/Zc 0sKBiJ3azbCuKGnz2y7HjIAO3kU1Do3RqqsjDA3catzc8n8qUt2j0iBJhmEMw/Oj 1A4Hbiem2EQXX5OTzvFkrQ2S3L4MhAjOjFDsPG6Edjt18Z8DSuuy94j6PYlnm02h Cl0W2I/70fCegg2uYiO7aNg31eF48hc19Yar5c4UpYORV0iaf8pLX5Xc1E8AixH3 T9/oakMh9o5JS/1J+gRprxbN+tdHNlVky46hAz0Hq4uB2wcJdsS/yPqGKjdRYGIZ ajmRewVcnoDVaJrdv1fqKbAdxfOkgi01fgSUq8+KRzP5Vleuj9H9mLEJRgpj6RXo irpyTZbevLqNnmCCuCBdC/t1akpk1tXWCE+sP8I2JwURbMNK1+PpXgIxCLxIsmr5 h9oPHjvUPmd5GisZbtYa =6UV7 -----END PGP SIGNATURE----- --6tpwwlpjmvkdsy5z--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161106110729.z2px7mzlhcwxvrvu>