Date: Mon, 6 Jun 2016 08:43:12 -0500 From: Pedro Giffuni <pfg@FreeBSD.org> To: Andrey Chernov <ache@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r301461 - in head/lib/libc: gen locale regex Message-ID: <cc6f1905-5cb6-0076-7da4-e1cfdbde857e@FreeBSD.org> In-Reply-To: <40c481fe-5585-45d2-d4e3-b9988a8198f3@freebsd.org> References: <201606051912.u55JCqdR036458@repo.freebsd.org> <40c481fe-5585-45d2-d4e3-b9988a8198f3@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06/05/16 14:49, Andrey Chernov wrote:
> On 05.06.2016 22:12, Pedro F. Giffuni wrote:
>> --- head/lib/libc/regex/regcomp.c Sun Jun 5 18:16:33 2016 (r301460)
>> +++ head/lib/libc/regex/regcomp.c Sun Jun 5 19:12:52 2016 (r301461)
>> @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs)
>> (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>> CHaddrange(p, cs, start, finish);
>> } else {
>> - (void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>> + (void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>> for (i = 0; i <= UCHAR_MAX; i++) {
>> - if ( __collate_range_cmp(table, start, i) <= 0
>> - && __collate_range_cmp(table, i, finish) <= 0
>> + if ( __wcollate_range_cmp(table, start, i) <= 0
>> + && __wcollate_range_cmp(table, i, finish) <= 0
>> )
>> CHadd(p, cs, i);
>> }
>>
>
> As I already mention in PR, we have broken regcomp after someone adds
> wchar_t support there. Now regcomp ranges works only for the first 256
> wchars of the current locale, notice that loop upper limit:
> for (i = 0; i <= UCHAR_MAX; i++) {
> In general, ranges are either broken in regcomp now or are memory
> eating. We have bitmask only for the first 256 wchars, all other added
> to the range literally. Imagine what happens if someone specify full
> Unicode range in regexp.
>
> Proper fix will be adding bitmask for the whole Unicode range, and even
> in that case regcomp attempting to use collation in ranges will be
> _very_slow_ since needs to check all Unicode chars in its
> for (i = 0; i <= Max_Unicode_wchar; i++) {
> loop.
>
> Better stop pretending that we are able to do collation support in the
> ranges, since POSIX cares about its own locale only here:
> "In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence,
> inclusive. In other locales, a range expression has unspecified
> behavior: strictly conforming applications shall not rely on whether the
> range expression is valid, or on the set of collating elements matched."
>
> Until whole Unicode range bitmask will be implemented (if ever), better
> stop pretending to honor collation order, we just can't do it with
> wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It
> does not prevent memory eating on big ranges (bitmask is needed, see
> above), but at least fix the thing that only first 256 wchars are
> considered.
>
Sadly regex is one part of the system that could use a maintainer :(,
I have been forced to look at it more than I'd like to but I don't
really use the collation support at all.
Pedro.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?cc6f1905-5cb6-0076-7da4-e1cfdbde857e>
