From owner-freebsd-stable@freebsd.org Sun Nov 6 21:14:59 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E3A64C34CEE for ; Sun, 6 Nov 2016 21:14:59 +0000 (UTC) (envelope-from shoesoft@gmx.net) Received: from mout.gmx.net (mout.gmx.net [212.227.17.21]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 41F407F3; Sun, 6 Nov 2016 21:14:58 +0000 (UTC) (envelope-from shoesoft@gmx.net) Received: from walrus.pepperland ([81.217.70.96]) by mail.gmx.com (mrgmx102) with ESMTPSA (Nemesis) id 0LjLwB-1cYQ4n0NoN-00dWFv; Sun, 06 Nov 2016 22:14:54 +0100 Subject: Re: Uppercase RE matching problems in FreeBSD 11 To: Stefan Bethke , Baptiste Daroussin References: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> Cc: Greg Rivers , freebsd-stable@freebsd.org From: Stefan Ehmann Message-ID: Date: Sun, 6 Nov 2016 22:14:50 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K0:JcXyqPqmtwVSV1j/cF/KBRcou+aIJ8vcMxmcdMEeUVJDK6TvEuT UypktZ19a3IrYYQwMcb2bZmDDz2J70vZkeHozsOsITC6+O+hDAXRwMf+/6H6aXN0Xnuh9qV 8HN1wTukmGLvbBLiE0QwtcbL4LPIbkxjHVtRHk/j1WSCv8HaFVTcCwJX3ygtRq8gBkW1ftY jSPnr9d1LtiqsZQy0YK9g== X-UI-Out-Filterresults: notjunk:1;V01:K0:JJEDCYSoCck=:nVNbVqWrFZ7stMHKHhxK3F XMe/C/ZkG0KoOt9dZPyxnYpGi3c8nvxmOSYnuuqpn9THXW0cqsY5kNtbzQUXPKBCfEAMvsRBl cgPdLs3BkqvfaqNxHJfY2Jq2p5lk5l52SGZKR+Yp6PkefqzZwY0BmPTI4/DPmQ8Qp5jYIRQmf AWJTmmGtih40w6/eD8FJWFwdr8Ik/BEJ6nyJlAImxRlkDZDRPg4j8npHYz+a5yQoQ6gdUCS34 ugLwmdYgt5VsDqOqAHv+gVYx1Idjy3W4Wd35/szlfX20zfcQju6xrgMmn7eQ3vFywzAc74tm5 2dxPyCRFkQmBpjugsQ/7OYIHuCsM81XqrzsG58g2lQC9hPBRcFpk6/GMYfkUaAvsZLP78qOPF yfyMn+IPIh+HhJw5EZ03aabYZvWdmBoV9cSm35zEj+0XJRtTgpj3xmpxh4y0NGAneWARjRUTq 6iS8O79pKd1jMWzb8J3edKTRoRPMx60vDn+0SH2JdUVMiQCYX6i1uOhkqSPZEXmDj9dBaS1mR brRWMnKFe7d9q6/VnZHB5dSqktB2G1ujtbt4AaJE5zkkaUqNDhEDxllRRCYaWCdrWwSVlxfIm WbN0A2MSD42yRnrFdnvcVGQBNu6OLAmsw+PFBc2Fs26AJueyl15tnQzC6TAdO3D+FvOykX4Q6 TRgVaSiS/Pu3U2djbwxyspEPI2GnzzPztKrDZPUP0xNgL0SThSROpOyET98u3rfAHWKEFeeAA q9XixXfsMCfSRv9Wh1J4hCyYWByizN4ZeMVsQTY5ZXOdS8DeQK1Hu67HtRuP5MER83bKzT5T0 lbxvGDV X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Nov 2016 21:15:00 -0000 On 06.11.2016 21:57, Stefan Bethke wrote: > >> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin >> : >> >> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >>> I happened to run an old script today that uses sed(1) to extract >>> the system boot time from the kern.boottime sysctl MIB. On 11.0 >>> this no longer works as expected: .. >>> Here sed thinks every lowercase character except for 'a' is >>> uppercase! This differs from the first test where sed did not >>> think 'o' is uppercase. Again, the above behaves as expected with >>> LANG=C. >>> >>> Does anyone have any insight into this? This is likely to break a >>> lot of existing code. >>> >> >> Yes A-Z only means uppercase in an ASCII only world in a unicode >> world it means AaBb... Z because there are way more characters that >> simple A-Z. In FreeBSD 11 we have a unicode collation instead of >> falling back in on LC_COLLATE=C which means ascii only >> >> For regrexp for example one should use the classes: :upper: or >> :lower:. > > That is rather surprising. Is there a normative reference for the > treatment of bracket expressions and character classes when using > locales other than C and/or encodings like UTF-8? I found an interesting article about this issue in gawk: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html Apparently the meaning of ranges is unspecified outside the "C" locale. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 says: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched"