From owner-freebsd-hackers@freebsd.org Tue Jan 23 19:11:09 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 56A4AEC1008 for ; Tue, 23 Jan 2018 19:11:09 +0000 (UTC) (envelope-from yuripv@icloud.com) Received: from pv33p00im-asmtp002.me.com (pv33p00im-asmtp002.me.com [17.142.194.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3880D69FFE; Tue, 23 Jan 2018 19:11:09 +0000 (UTC) (envelope-from yuripv@icloud.com) Received: from process-dkim-sign-daemon.pv33p00im-asmtp002.me.com by pv33p00im-asmtp002.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) id <0P3000000VEKPM00@pv33p00im-asmtp002.me.com>; Tue, 23 Jan 2018 19:10:57 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=04042017; t=1516734657; bh=hRrqkbv7FzRKg+04ixA7QSHxnn2WIJg5ps+UXR3v8LI=; h=Subject:To:From:Message-id:Date:MIME-version:Content-type; b=oYH2+FETj+NOFKdw5ZHCaGJypJPk2lH4MZIApl7yPFhzwifpUrDf2+OFS38K1KUWd hTAlO75yuPLraHW1p4rtCJS5t2HMw2P8ux7NJ5CQ3B7Qmss/W1nUyQDFsvoDC66g7K 6rnfIF2HcYAeFtZgOtWiAHzh4MkQjTMQQh6QQqEf4kat8jdATekjGHJeR28i0wjt8w GsUwoKdZO4HKsurABWiWqUZtPUbIVZRX+HrbDWqrnZhZNUL/e3bAXCLy5AKPyY/lHi /aBytcVx45REMJrv0d7FPEJEhYSKtPmvaxHMrIeb9DXRk/cYGnuY7HK1qOrZf/ghsI ppwb7KQ3/Mknw== Received: from icloud.com ([127.0.0.1]) by pv33p00im-asmtp002.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) with ESMTPSA id <0P3000HR4VY28D10@pv33p00im-asmtp002.me.com>; Tue, 23 Jan 2018 19:10:55 +0000 (GMT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-01-23_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 clxscore=1015 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1801230262 Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges To: Kyle Evans Cc: FreeBSD Hackers References: From: Yuri Pankov Message-id: Date: Tue, 23 Jan 2018 22:10:49 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-version: 1.0 In-reply-to: Content-type: text/plain; charset=utf-8; format=flowed Content-language: en-US Content-transfer-encoding: 8bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jan 2018 19:11:09 -0000 On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote: > On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov wrote: >> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote: >>> >>> (CCing Kyle as he's working on regex at the moment and not because he >>> broke something) >>> >>> Hi, >>> >>> r302284 added an invalid check which breaks collating ranges: >>> >>> -if (table->__collate_load_error) { >>> - (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); >>> +if (table->__collate_load_error || MB_CUR_MAX > 1) { >>> + (void)REQUIRE(start <= finish, REG_ERANGE); >>> >>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison >>> according to current locale's collation and not simply comparing the >>> wchar_t values. >> >> >> After re-reading the specification I now see that what looked like a bug is >> actually an implementation choice, though the one that needs to be >> documentated. I'll update the man page if anyone is willing to review (and >> commit) the changes. > > Can you point to the section of specification that indicates this is > OK behavior? It doesn't seem desirable, but I see that GNU systems > will operate in the same manner that we do now. Here -- http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html: ------------------------------------------------------------------------ In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. ------------------------------------------------------------------------ I've tried to "fix" what I was seeing as well, and yes, everything outside of ASCII is ugly, e.g. Cyrillic 'а-я' would match much more than you could expect if you are doing lookups based on collation order (capital chars and a lot of other symbols). So what we have currently looks the least evil to me: - non-collating ASCII lookups for any locale -- looking at the log for regcomp.c there was an attempt to "fix" this, but it was reverted as a lot of existing code relies on this; - non-collating multi-byte locale lookups -- they will work for almost all cases, and where they don't, well POSIX says it's undefined :D - collating single-byte locale lookups for outside of ASCII range -- they make sense as collation order there doesn't seem to mix small/caps/other characters together. What I think we need to do is document this as implementation choice in the code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul doesn't come trying to fix it as I did :-)