From owner-freebsd-hackers@freebsd.org  Tue Jan 23 06:53:41 2018
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 213C9EB4E8C
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue, 23 Jan 2018 06:53:41 +0000 (UTC)
 (envelope-from byond.lenox@gmail.com)
Received: from mail-it0-f41.google.com (mail-it0-f41.google.com
 [209.85.214.41])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id E20AF182C
 for <freebsd-hackers@freebsd.org>; Tue, 23 Jan 2018 06:53:40 +0000 (UTC)
 (envelope-from byond.lenox@gmail.com)
Received: by mail-it0-f41.google.com with SMTP id x42so12864895ita.4
 for <freebsd-hackers@freebsd.org>; Mon, 22 Jan 2018 22:53:40 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=gsnQvoQ3IsorqMoKazL7eTOKWVOScE7At1dKlboyD9k=;
 b=Tk6Z2AEfJB0aMEuTPG59uDUK6ldUNPT5zd0jqd0f5M0Z6dz9nohwwV7WQtc1WI+Urm
 fCQW0wcevFdygIBb+K0aCQ5i4JgijR3INfjeudxJuEe4IHpsZIS6REm2F93J8Vfn7IDA
 E9MULHdE7T9quCduCJm0P0TAooYHiEGO0Ppkpji8buym+wd16CvptA6wsQmqNv5LWIJT
 tmqmU5XqpHqd72wnMqtD4qgpr6toBNs3CeKeKetdv4SwCamN1PG2gGW0/LJb2TXz80Rl
 D0j6FuKkkk//jaSYLeolEIlBhxfuCGJWhCGxx2/qKSsgnneLXsDUjkW6kVQes4AoU1z/
 PxYw==
X-Gm-Message-State: AKwxytenwBMdQR9VLgl6/eLd/QgnOuN5ZQknl2pwGmU+/4r+OYrgZJxr
 +u6qYVMzX5xJ+jddYkVJ62k4bZ0b
X-Google-Smtp-Source: AH8x226GULjLmy8rM+Tj4DEyxlGQWwnc+viZlUuRaFYZx6CTIZGsaURtyUbUEsL7vDyLLHypYUUA9w==
X-Received: by 10.36.190.15 with SMTP id i15mr2368042itf.109.1516686868550;
 Mon, 22 Jan 2018 21:54:28 -0800 (PST)
Received: from mail-io0-f178.google.com (mail-io0-f178.google.com.
 [209.85.223.178])
 by smtp.gmail.com with ESMTPSA id l69sm9084248ioi.11.2018.01.22.21.54.28
 for <freebsd-hackers@freebsd.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 22 Jan 2018 21:54:28 -0800 (PST)
Received: by mail-io0-f178.google.com with SMTP id d13so7480218iog.5
 for <freebsd-hackers@freebsd.org>; Mon, 22 Jan 2018 21:54:28 -0800 (PST)
X-Received: by 10.107.50.210 with SMTP id y201mr352171ioy.224.1516686867978;
 Mon, 22 Jan 2018 21:54:27 -0800 (PST)
MIME-Version: 1.0
Received: by 10.107.157.12 with HTTP; Mon, 22 Jan 2018 21:54:07 -0800 (PST)
In-Reply-To: <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com>
References: <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com>
From: Kyle Evans <kevans@freebsd.org>
Date: Mon, 22 Jan 2018 23:54:07 -0600
X-Gmail-Original-Message-ID: <CACNAnaF43C77dv+WfxFSDGCnSqhm7ht-sh-V4825jPnKFXowxg@mail.gmail.com>
Message-ID: <CACNAnaF43C77dv+WfxFSDGCnSqhm7ht-sh-V4825jPnKFXowxg@mail.gmail.com>
Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges
To: Yuri Pankov <yuripv@icloud.com>
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jan 2018 06:53:41 -0000

On Mon, Jan 22, 2018 at 6:53 PM, Yuri Pankov <yuripv@icloud.com> wrote:
> (CCing Kyle as he's working on regex at the moment and not because he bro=
ke
> something)

Phew, brief moment of panic until I read this line and went back to
actually look at the full revision number you wrote. =3Dp

I note here that I know incredibly little about collation issues, so
feel free to enlighten me.

> Hi,
>
> r302284 added an invalid check which breaks collating ranges:
>
> -if (table->__collate_load_error) {
> -    (void)REQUIRE((uch)start <=3D (uch)finish, REG_ERANGE);
> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
> +    (void)REQUIRE(start <=3D finish, REG_ERANGE);
>
> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
> according to current locale's collation and not simply comparing the wcha=
r_t
> values.

Right; that seems like almost the complete opposite of when you might
be able to do a direct comparison like this. As you mention, though,
even MB_CUR_MAX =3D=3D 1 wouldn't necessarily be safe.

> Example -- see Table 1 in http://www.unicode.org/reports/tr10/:
>
> Let's try Swedish collation:
> $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[=C3=B6-z]'
> grep: invalid character range
> $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[z-=C3=B6]'
>
> OK, the above seems to be correct, '=C3=B6' > 'z' in Swedish collation, b=
ut we
> just got lucky here, as wchar_t comparison gives us the same result.
>
> Now German one:
> $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[=C3=B6-z]'
> grep: invalid character range
> $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[z-=C3=B6]'
>
> Same, but according to the table, '=C3=B6' < 'z' in German collation!
>
> I think the fix here would be to drop the "if (table->__collate_load_erro=
r
> || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so
> there's no point in getting it and checking error, wcscoll() which would =
be
> called eventually in p_range_cmp() does the table handling itself, and we
> can't use the direct comparison for anything other than 'C' locale (not s=
ure
> if it's applicable even there).

I've arrived at the same conclusion, and it makes me both sad and
happy at the same time. I had a lot more written here, but erased it
because I clearly fail to read some of this stuff.

I have a proposed patch at [1] that addresses this in theory, but the
patch is not right and breaks things in en_US.UTF-8 at the very least.
`echo "TEST" | sed -e 's/[s-t]/x/g'` outputs "TExT"; debug output
shows that it's adding to the character set: 'S', 's', 't', and
U+00DF. I've not been able to dig any deeper and figure out why
p_range_comp/wcscoll place these things in the range of [s-t] (115 -
116) with en_US.UTF-8 -- the answer isn't immediately obvious to me.

Do note that this patch might not apply against a clean -head, but the
spirit of it is still there. It implements your suggestion, but
rewrites the loop in the (former) second branch to keep track of
contiguous ranges of characters. This is a (pehaps misguided) attempt
to take more of the performance hit at expr compilation time rather
than execution, reducing cs->wides as much as possible so that we
don't have to do as many comparisons in not worst-case scenarios.

I've also rearranged the loops in regex2.h:CHIN1 so that we check the
smaller/faster comparisons before checking the larger character set at
wide. I don't know if it really makes for any of a performance
difference, but it seems worth it to at least benchmark for some more
complex character sets one can build in different locales with
mixtures of actually contiguous ranges (in the current locale) and
randomly distributed non-contiguous characters.

[1] https://people.freebsd.org/~kevans/regex-collation.diff