From owner-freebsd-stable@freebsd.org Sun Nov 6 12:26:58 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 56F29C306BF for ; Sun, 6 Nov 2016 12:26:58 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from mail.ijs.si (mail.ijs.si [IPv6:2001:1470:ff80::25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DDAF4E4C for ; Sun, 6 Nov 2016 12:26:57 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from amavis-ori.ijs.si (localhost [IPv6:::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.ijs.si (Postfix) with ESMTPS id 3tBZZR0dsMz1W6 for ; Sun, 6 Nov 2016 13:26:55 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ijs.si; h= user-agent:message-id:references:in-reply-to:organization :subject:subject:from:from:date:date:content-transfer-encoding :content-type:content-type:mime-version:received:received :received:received; s=jakla4; t=1478435212; x=1481027213; bh=0MB H8TAgd1m6OOJOD0JwzGdaJy0AOcj4hrCIaFyFGWc=; b=ZcS0WFnfo4akaTEICCJ HqXIU8sZmut121I+fRnL9IvAkcVN3I1X9yzWjGj98IBV9tFhJkp/GoEcLa25Y4Ps 26F5RWRjljkjiGeBDAphJC448jMWypzin1NBAc8e8enbMY2TSPcwQFFUqRLie4o0 UHQDO6MxoDBzt/9qgxW66mX0= X-Virus-Scanned: amavisd-new at ijs.si Received: from mail.ijs.si ([IPv6:::1]) by amavis-ori.ijs.si (mail.ijs.si [IPv6:::1]) (amavisd-new, port 10026) with LMTP id ITFCS-S_TOCG for ; Sun, 6 Nov 2016 13:26:52 +0100 (CET) Received: from mildred.ijs.si (mailbox.ijs.si [IPv6:2001:1470:ff80::143:1]) by mail.ijs.si (Postfix) with ESMTP id 3tBZZN0s4Gz1W5 for ; Sun, 6 Nov 2016 13:26:52 +0100 (CET) Received: from nabiralnik.ijs.si (nabiralnik.ijs.si [IPv6:2001:1470:ff80::80:16]) by mildred.ijs.si (Postfix) with ESMTP id 3tBZZN00hLzhY for ; Sun, 6 Nov 2016 13:26:52 +0100 (CET) Received: from sleepy.ijs.si (2001:1470:ff80:e001::1:1) by webmail.ijs.si with HTTP (HTTP/1.1 POST); Sun, 06 Nov 2016 13:26:51 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Sun, 06 Nov 2016 13:26:51 +0100 From: Mark Martinec To: freebsd-stable@freebsd.org Subject: Re: Uppercase RE matching problems in FreeBSD 11 Organization: Jozef Stefan Institute In-Reply-To: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> References: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> Message-ID: <71a45ece6ec63bf696edab5b31abdaf5@ijs.si> X-Sender: Mark.Martinec+freebsd@ijs.si User-Agent: Roundcube Webmail/1.2.2 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Nov 2016 12:26:58 -0000 2016-11-06 12:07, Baptiste Daroussin wrote: > Yes A-Z only means uppercase in an ASCII only world in a unicode world > it means > AaBb... Z because there are way more characters that simple A-Z. In > FreeBSD 11 > we have a unicode collation instead of falling back in on LC_COLLATE=C > which > means ascii only > > For regrexp for example one should use the classes: :upper: or :lower:. It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) at "C" when LANG or LC_CTYPE is set to something else, otherwise unexpected things may happen. Mark > On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote: >> I happened to run an old script today that uses sed(1) to extract the >> system >> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer >> works as >> expected: >> >> $ sysctl kern.boottime >> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34 >> 2016 >> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/' >> v 5 16:18:34 2016 >> >> sed passes over 'S' and 'N' until it hits 'v', which it considers >> uppercase >> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works >> as >> expected: >> >> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/' >> Nov 5 16:18:34 2016 >> >> Testing every lowercase character separately gives even more >> inconsistent >> results: >> >> $ cat <> > a >> > b >> > c >> > d >> > e >> > f >> > g >> > h >> > i >> > j >> > k >> > l >> > m >> > n >> > o >> > p >> > q >> > r >> > s >> > t >> > u >> > v >> > w >> > x >> > y >> > z >> > ! >> b >> c >> d >> e >> f >> g >> h >> i >> j >> k >> l >> m >> n >> o >> p >> q >> r >> s >> t >> u >> v >> w >> x >> y >> z >> >> Here sed thinks every lowercase character except for 'a' is uppercase! >> This >> differs from the first test where sed did not think 'o' is uppercase. >> Again, >> the above behaves as expected with LANG=C. >> >> Does anyone have any insight into this? This is likely to break a lot >> of >> existing code.