From owner-freebsd-hackers@freebsd.org  Wed Jan 24 04:22:36 2018
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 81D3FEBB1E0
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Wed, 24 Jan 2018 04:22:36 +0000 (UTC)
 (envelope-from yuripv@icloud.com)
Received: from pv33p00im-asmtp002.me.com (pv33p00im-asmtp002.me.com
 [17.142.194.251])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 1F4B076A20;
 Wed, 24 Jan 2018 04:22:36 +0000 (UTC)
 (envelope-from yuripv@icloud.com)
Received: from process-dkim-sign-daemon.pv33p00im-asmtp002.me.com by
 pv33p00im-asmtp002.me.com
 (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun  7
 2017)) id <0P3100400KGPP600@pv33p00im-asmtp002.me.com>; Wed,
 24 Jan 2018 04:22:28 +0000 (GMT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=04042017;
 t=1516767748;	bh=XcbFr2h4OucEU9ViJwQnInlMTE4TnO5pnZtUzBL9ziU=;
 h=Subject:From:To:Message-id:Date:MIME-version:Content-type;
 b=ARxAqlWOw1JBQLCTsgys9LdPdqQtxKJns0r2bmnq4oPawkJFSlB0e19z+hfBYUJvr
 vsJip8PWKESEv3H/tRS9GcClrwJk12zoBiMw33BfUofh/DnMhVql/GgK9x45ld6aiK
 DkE42jYkr8gyXQQKBQ9b4X1KcoGGhFhGOZONZXcbg2gglFzE42CuQ4kmXwG8xWkh46
 72tYreNSFxptkgSl9VK5gd2/8fNTT/UmjEsJZv0Ykrwy3QdIkKxYpUiyTLC35Ve8f/
 5L2vp4/3Ohm2gKWUf4aU+yhksThaNtM5QFeGptJBqOVfxKqjAvRhuq/cP5lt+8eQf8
 hQ3elXDvuG3qg==
Received: from icloud.com ([127.0.0.1]) by pv33p00im-asmtp002.me.com
 (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun  7
 2017)) with ESMTPSA id <0P31007J2LH95I10@pv33p00im-asmtp002.me.com>; Wed,
 24 Jan 2018 04:22:24 +0000 (GMT)
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,,
 definitions=2018-01-24_01:,, signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 clxscore=1015 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0
 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.0.1-1707230000 definitions=main-1801240054
Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges
From: Yuri Pankov <yuripv@icloud.com>
To: Kyle Evans <kevans@freebsd.org>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
References: <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com>
 <e192f9f7-9d9c-d1e3-8db4-02226ffa23d3@icloud.com>
 <CACNAnaF2aJ5EqLSCLTRkGH+q5SYMmxD1dygGd8NFrkA9STJX8A@mail.gmail.com>
 <f17e0e80-dc27-568a-2896-5cb38ebc470f@icloud.com>
 <CACNAnaGioTGUFpNEakT-b88f_pa7ZaAXBPp09ruTzD8HFWWQfQ@mail.gmail.com>
 <2c9ebf81-c06a-13ed-9cf9-9b42a00c76ee@icloud.com>
Message-id: <6c84d7ad-26bd-ec67-143b-b6e41d6018e6@icloud.com>
Date: Wed, 24 Jan 2018 07:22:20 +0300
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.2
MIME-version: 1.0
In-reply-to: <2c9ebf81-c06a-13ed-9cf9-9b42a00c76ee@icloud.com>
Content-type: text/plain; charset=utf-8; format=flowed
Content-language: en-US
Content-transfer-encoding: 8bit
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Jan 2018 04:22:36 -0000

On Wed, Jan 24, 2018 at 07:17:27AM +0300, Yuri Pankov wrote:
> On Tue, Jan 23, 2018 at 01:22:04PM -0600, Kyle Evans wrote:
>> On Tue, Jan 23, 2018 at 1:10 PM, Yuri Pankov <yuripv@icloud.com> wrote:
>>> On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote:
>>>>
>>>> On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov <yuripv@icloud.com> wrote:
>>>>>
>>>>> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote:
>>>>>>
>>>>>>
>>>>>> (CCing Kyle as he's working on regex at the moment and not because he
>>>>>> broke something)
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> r302284 added an invalid check which breaks collating ranges:
>>>>>>
>>>>>> -if (table->__collate_load_error) {
>>>>>> -    (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>>>>>> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
>>>>>> +    (void)REQUIRE(start <= finish, REG_ERANGE);
>>>>>>
>>>>>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
>>>>>> according to current locale's collation and not simply comparing the
>>>>>> wchar_t values.
>>>>>
>>>>>
>>>>>
>>>>> After re-reading the specification I now see that what looked like a bug
>>>>> is
>>>>> actually an implementation choice, though the one that needs to be
>>>>> documentated.  I'll update the man page if anyone is willing to review
>>>>> (and
>>>>> commit) the changes.
>>>>
>>>>
>>>> Can you point to the section of specification that indicates this is
>>>> OK behavior? It doesn't seem desirable, but I see that GNU systems
>>>> will operate in the same manner that we do now.
>>>
>>>
>>> Here --
>>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html:
>>> ------------------------------------------------------------------------
>>> In the POSIX locale, a range expression represents the set of collating
>>> elements that fall between two elements in the collation sequence,
>>> inclusive. In other locales, a range expression has unspecified behavior:
>>> strictly conforming applications shall not rely on whether the range
>>> expression is valid, or on the set of collating elements matched.
>>> ------------------------------------------------------------------------
>>>
>>
>> Thanks- our current behavior seems reasonable in that context.
>>
>>> I've tried to "fix" what I was seeing as well, and yes, everything outside
>>> of ASCII is ugly, e.g. Cyrillic 'а-я' would match much more than you could
>>> expect if you are doing lookups based on collation order (capital chars and
>>> a lot of other symbols).
>>>
>>> So what we have currently looks the least evil to me:
>>>
>>> - non-collating ASCII lookups for any locale -- looking at the log for
>>>     regcomp.c there was an attempt to "fix" this, but it was reverted as
>>>     a lot of existing code relies on this;
>>> - non-collating multi-byte locale lookups -- they will work for almost
>>>     all   cases, and where they don't, well POSIX says it's undefined :D
>>> - collating single-byte locale lookups for outside of ASCII range --
>>>     they make sense as collation order there doesn't seem to mix
>>>     small/caps/other characters together.
>>>
>>> What I think we need to do is document this as implementation choice in the
>>> code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul doesn't
>>> come trying to fix it as I did :-)
>>
>> I agree with your assessment- such a patch would be welcomed,
>> especially before I go and revise a bunch of this for clarification in
>> a future libregex world.
> 
> Actually, it's broken even more than I thought:
> 
> $ echo 'TEST' | LC_ALL=en_US.ISO8859-1 grep '[a-z]'
> TEST
> 
> That's a result of using collation lookup for singlebyte locales.  Now I
> just think that using collations for range expressions in *any* locale
> is just plain wrong.
> 
> Another side effect of all this "sometimes non-collating" nonsense is
> inability to deal with multibyte characters whose corresponding wide
> character is in 128-255 range -- try adding 'µ' (\302\265, U+00B5) to
> the pattern and observe a nice ~1GB core from grep after endless loop in
> regcomp().  This is due to NC (I guess meaning "non-collating") being
> defined as (CHAR_MAX - CHAR_MIN + 1) which is 256.

Oh, and the lookup should be case-insensitive above to reproduce the issue.

> To sum the above, how about we drop the "non-collating" notion, and just
> use binary wide character comparison everywhere?