From owner-freebsd-hackers@freebsd.org  Tue Jan 23 07:15:17 2018
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B921EB658A
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue, 23 Jan 2018 07:15:17 +0000 (UTC)
 (envelope-from byond.lenox@gmail.com)
Received: from mail-io0-f182.google.com (mail-io0-f182.google.com
 [209.85.223.182])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id E70393C38
 for <freebsd-hackers@freebsd.org>; Tue, 23 Jan 2018 07:15:16 +0000 (UTC)
 (envelope-from byond.lenox@gmail.com)
Received: by mail-io0-f182.google.com with SMTP id f89so12252452ioj.4
 for <freebsd-hackers@freebsd.org>; Mon, 22 Jan 2018 23:15:16 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=CQXGmdXom7YSMziybkk9hSTd357uV4qcVpZLNrvSj98=;
 b=J3jdxenf0rRQV6BH2dmcyJzl06T89638h1B/j5OQisgDmCKgodZFU2Cjhc09slvajY
 h9RprHQI8vzNCXfNxmxPrh6DuD5u7pJXk+vTINUE9Lmsh8EFXbIifba7D+mipzY5Hd47
 HqGcNPqsBrX9zsZ/omcX/VP777YvlsucCAg6ESDcMQ/cMUioG59nRFiUEqrAzO7BQc8A
 bQ/wg5Ai4h7PxLdZcAn12HKCMxkTk1OXB0J1j5Xjc2FSEoEMxSyZ5OCkks6peuHJLg2z
 EtrYFsTt0WEo3qNQoSLYPUIpSkGpGhtGC5SsAnLVDwTxtPWf9laSxhhKJ0gtJdV0xBL2
 6awA==
X-Gm-Message-State: AKwxyteFGMDfsqfEovqlLtxpXMbSk9V+NSdpod5yu98eUFS6Fz8LbFsd
 tEQag6UgaT95yZ5/UCXLWYWGNBE6
X-Google-Smtp-Source: AH8x224WQjwEdAJFHvH6p5GH9ZxciMw2d293g40dOLvQLAOMJn5Gu+NhTBQN8MeFxPgzdL4nYCuvAA==
X-Received: by 10.107.139.78 with SMTP id n75mr2434323iod.252.1516688076691;
 Mon, 22 Jan 2018 22:14:36 -0800 (PST)
Received: from mail-io0-f180.google.com (mail-io0-f180.google.com.
 [209.85.223.180])
 by smtp.gmail.com with ESMTPSA id d189sm8772276iog.77.2018.01.22.22.14.36
 for <freebsd-hackers@freebsd.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 22 Jan 2018 22:14:36 -0800 (PST)
Received: by mail-io0-f180.google.com with SMTP id m11so12097573iob.2
 for <freebsd-hackers@freebsd.org>; Mon, 22 Jan 2018 22:14:36 -0800 (PST)
X-Received: by 10.107.53.221 with SMTP id k90mr2377237ioo.6.1516688076195;
 Mon, 22 Jan 2018 22:14:36 -0800 (PST)
MIME-Version: 1.0
Received: by 10.107.157.12 with HTTP; Mon, 22 Jan 2018 22:14:35 -0800 (PST)
Received: by 10.107.157.12 with HTTP; Mon, 22 Jan 2018 22:14:35 -0800 (PST)
In-Reply-To: <CACNAnaF43C77dv+WfxFSDGCnSqhm7ht-sh-V4825jPnKFXowxg@mail.gmail.com>
References: <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com>
 <CACNAnaF43C77dv+WfxFSDGCnSqhm7ht-sh-V4825jPnKFXowxg@mail.gmail.com>
From: Kyle Evans <kevans@freebsd.org>
Date: Tue, 23 Jan 2018 00:14:35 -0600
X-Gmail-Original-Message-ID: <CACNAnaHxXMiBk1+o0S6CW_4WgS-iUbjd+891AmZ4RQmqA3R-yw@mail.gmail.com>
Message-ID: <CACNAnaHxXMiBk1+o0S6CW_4WgS-iUbjd+891AmZ4RQmqA3R-yw@mail.gmail.com>
Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges
To: Yuri Pankov <yuripv@icloud.com>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jan 2018 07:15:17 -0000

(apologies for broken quoting, on mobile)

On Jan 22, 2018 11:54 PM, "Kyle Evans" <kevans@freebsd.org> wrote:

On Mon, Jan 22, 2018 at 6:53 PM, Yuri Pankov <yuripv@icloud.com> wrote:
> (CCing Kyle as he's working on regex at the moment and not because he
broke
> something)

Phew, brief moment of panic until I read this line and went back to
actually look at the full revision number you wrote. =3Dp

I note here that I know incredibly little about collation issues, so
feel free to enlighten me.

> Hi,
>
> r302284 added an invalid check which breaks collating ranges:
>
> -if (table->__collate_load_error) {
> -    (void)REQUIRE((uch)start <=3D (uch)finish, REG_ERANGE);
> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
> +    (void)REQUIRE(start <=3D finish, REG_ERANGE);
>
> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
> according to current locale's collation and not simply comparing the
wchar_t
> values.

Right; that seems like almost the complete opposite of when you might
be able to do a direct comparison like this. As you mention, though,
even MB_CUR_MAX =3D=3D 1 wouldn't necessarily be safe.

> Example -- see Table 1 in http://www.unicode.org/reports/tr10/:
>
> Let's try Swedish collation:
> $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[=C3=B6-z]'
> grep: invalid character range
> $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[z-=C3=B6]'
>
> OK, the above seems to be correct, '=C3=B6' > 'z' in Swedish collation, b=
ut we
> just got lucky here, as wchar_t comparison gives us the same result.
>
> Now German one:
> $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[=C3=B6-z]'
> grep: invalid character range
> $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[z-=C3=B6]'
>
> Same, but according to the table, '=C3=B6' < 'z' in German collation!
>
> I think the fix here would be to drop the "if (table->__collate_load_erro=
r
> || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so
> there's no point in getting it and checking error, wcscoll() which would
be
> called eventually in p_range_cmp() does the table handling itself, and we
> can't use the direct comparison for anything other than 'C' locale (not
sure
> if it's applicable even there).

I've arrived at the same conclusion, and it makes me both sad and
happy at the same time. I had a lot more written here, but erased it
because I clearly fail to read some of this stuff.

I have a proposed patch at [1] that addresses this in theory, but the
patch is not right and breaks things in en_US.UTF-8 at the very least.
`echo "TEST" | sed -e 's/[s-t]/x/g'` outputs "TExT"; debug output
shows that it's adding to the character set: 'S', 's', 't', and
U+00DF. I've not been able to dig any deeper and figure out why
p_range_comp/wcscoll place these things in the range of [s-t] (115 -
116) with en_US.UTF-8 -- the answer isn't immediately obvious to me.


I see now where I goofed this up- please disregard.


Do note that this patch might not apply against a clean -head, but the
spirit of it is still there. It implements your suggestion, but
rewrites the loop in the (former) second branch to keep track of
contiguous ranges of characters. This is a (pehaps misguided) attempt
to take more of the performance hit at expr compilation time rather
than execution, reducing cs->wides as much as possible so that we
don't have to do as many comparisons in not worst-case scenarios.

I've also rearranged the loops in regex2.h:CHIN1 so that we check the
smaller/faster comparisons before checking the larger character set at
wide. I don't know if it really makes for any of a performance
difference, but it seems worth it to at least benchmark for some more
complex character sets one can build in different locales with
mixtures of actually contiguous ranges (in the current locale) and
randomly distributed non-contiguous characters.

[1] https://people.freebsd.org/~kevans/regex-collation.diff