Date: Wed, 24 Jan 2018 07:22:20 +0300 From: Yuri Pankov <yuripv@icloud.com> To: Kyle Evans <kevans@freebsd.org> Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges Message-ID: <6c84d7ad-26bd-ec67-143b-b6e41d6018e6@icloud.com> In-Reply-To: <2c9ebf81-c06a-13ed-9cf9-9b42a00c76ee@icloud.com> References: <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com> <e192f9f7-9d9c-d1e3-8db4-02226ffa23d3@icloud.com> <CACNAnaF2aJ5EqLSCLTRkGH%2Bq5SYMmxD1dygGd8NFrkA9STJX8A@mail.gmail.com> <f17e0e80-dc27-568a-2896-5cb38ebc470f@icloud.com> <CACNAnaGioTGUFpNEakT-b88f_pa7ZaAXBPp09ruTzD8HFWWQfQ@mail.gmail.com> <2c9ebf81-c06a-13ed-9cf9-9b42a00c76ee@icloud.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Jan 24, 2018 at 07:17:27AM +0300, Yuri Pankov wrote: > On Tue, Jan 23, 2018 at 01:22:04PM -0600, Kyle Evans wrote: >> On Tue, Jan 23, 2018 at 1:10 PM, Yuri Pankov <yuripv@icloud.com> wrote: >>> On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote: >>>> >>>> On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov <yuripv@icloud.com> wrote: >>>>> >>>>> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote: >>>>>> >>>>>> >>>>>> (CCing Kyle as he's working on regex at the moment and not because he >>>>>> broke something) >>>>>> >>>>>> Hi, >>>>>> >>>>>> r302284 added an invalid check which breaks collating ranges: >>>>>> >>>>>> -if (table->__collate_load_error) { >>>>>> - (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); >>>>>> +if (table->__collate_load_error || MB_CUR_MAX > 1) { >>>>>> + (void)REQUIRE(start <= finish, REG_ERANGE); >>>>>> >>>>>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison >>>>>> according to current locale's collation and not simply comparing the >>>>>> wchar_t values. >>>>> >>>>> >>>>> >>>>> After re-reading the specification I now see that what looked like a bug >>>>> is >>>>> actually an implementation choice, though the one that needs to be >>>>> documentated. I'll update the man page if anyone is willing to review >>>>> (and >>>>> commit) the changes. >>>> >>>> >>>> Can you point to the section of specification that indicates this is >>>> OK behavior? It doesn't seem desirable, but I see that GNU systems >>>> will operate in the same manner that we do now. >>> >>> >>> Here -- >>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html: >>> ------------------------------------------------------------------------ >>> In the POSIX locale, a range expression represents the set of collating >>> elements that fall between two elements in the collation sequence, >>> inclusive. In other locales, a range expression has unspecified behavior: >>> strictly conforming applications shall not rely on whether the range >>> expression is valid, or on the set of collating elements matched. >>> ------------------------------------------------------------------------ >>> >> >> Thanks- our current behavior seems reasonable in that context. >> >>> I've tried to "fix" what I was seeing as well, and yes, everything outside >>> of ASCII is ugly, e.g. Cyrillic 'а-я' would match much more than you could >>> expect if you are doing lookups based on collation order (capital chars and >>> a lot of other symbols). >>> >>> So what we have currently looks the least evil to me: >>> >>> - non-collating ASCII lookups for any locale -- looking at the log for >>> regcomp.c there was an attempt to "fix" this, but it was reverted as >>> a lot of existing code relies on this; >>> - non-collating multi-byte locale lookups -- they will work for almost >>> all cases, and where they don't, well POSIX says it's undefined :D >>> - collating single-byte locale lookups for outside of ASCII range -- >>> they make sense as collation order there doesn't seem to mix >>> small/caps/other characters together. >>> >>> What I think we need to do is document this as implementation choice in the >>> code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul doesn't >>> come trying to fix it as I did :-) >> >> I agree with your assessment- such a patch would be welcomed, >> especially before I go and revise a bunch of this for clarification in >> a future libregex world. > > Actually, it's broken even more than I thought: > > $ echo 'TEST' | LC_ALL=en_US.ISO8859-1 grep '[a-z]' > TEST > > That's a result of using collation lookup for singlebyte locales. Now I > just think that using collations for range expressions in *any* locale > is just plain wrong. > > Another side effect of all this "sometimes non-collating" nonsense is > inability to deal with multibyte characters whose corresponding wide > character is in 128-255 range -- try adding 'µ' (\302\265, U+00B5) to > the pattern and observe a nice ~1GB core from grep after endless loop in > regcomp(). This is due to NC (I guess meaning "non-collating") being > defined as (CHAR_MAX - CHAR_MIN + 1) which is 256. Oh, and the lookup should be case-insensitive above to reproduce the issue. > To sum the above, how about we drop the "non-collating" notion, and just > use binary wide character comparison everywhere?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6c84d7ad-26bd-ec67-143b-b6e41d6018e6>