From owner-freebsd-hackers@freebsd.org Tue Jan 23 07:15:17 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B921EB658A for ; Tue, 23 Jan 2018 07:15:17 +0000 (UTC) (envelope-from byond.lenox@gmail.com) Received: from mail-io0-f182.google.com (mail-io0-f182.google.com [209.85.223.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E70393C38 for ; Tue, 23 Jan 2018 07:15:16 +0000 (UTC) (envelope-from byond.lenox@gmail.com) Received: by mail-io0-f182.google.com with SMTP id f89so12252452ioj.4 for ; Mon, 22 Jan 2018 23:15:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=CQXGmdXom7YSMziybkk9hSTd357uV4qcVpZLNrvSj98=; b=J3jdxenf0rRQV6BH2dmcyJzl06T89638h1B/j5OQisgDmCKgodZFU2Cjhc09slvajY h9RprHQI8vzNCXfNxmxPrh6DuD5u7pJXk+vTINUE9Lmsh8EFXbIifba7D+mipzY5Hd47 HqGcNPqsBrX9zsZ/omcX/VP777YvlsucCAg6ESDcMQ/cMUioG59nRFiUEqrAzO7BQc8A bQ/wg5Ai4h7PxLdZcAn12HKCMxkTk1OXB0J1j5Xjc2FSEoEMxSyZ5OCkks6peuHJLg2z EtrYFsTt0WEo3qNQoSLYPUIpSkGpGhtGC5SsAnLVDwTxtPWf9laSxhhKJ0gtJdV0xBL2 6awA== X-Gm-Message-State: AKwxyteFGMDfsqfEovqlLtxpXMbSk9V+NSdpod5yu98eUFS6Fz8LbFsd tEQag6UgaT95yZ5/UCXLWYWGNBE6 X-Google-Smtp-Source: AH8x224WQjwEdAJFHvH6p5GH9ZxciMw2d293g40dOLvQLAOMJn5Gu+NhTBQN8MeFxPgzdL4nYCuvAA== X-Received: by 10.107.139.78 with SMTP id n75mr2434323iod.252.1516688076691; Mon, 22 Jan 2018 22:14:36 -0800 (PST) Received: from mail-io0-f180.google.com (mail-io0-f180.google.com. [209.85.223.180]) by smtp.gmail.com with ESMTPSA id d189sm8772276iog.77.2018.01.22.22.14.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 Jan 2018 22:14:36 -0800 (PST) Received: by mail-io0-f180.google.com with SMTP id m11so12097573iob.2 for ; Mon, 22 Jan 2018 22:14:36 -0800 (PST) X-Received: by 10.107.53.221 with SMTP id k90mr2377237ioo.6.1516688076195; Mon, 22 Jan 2018 22:14:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.157.12 with HTTP; Mon, 22 Jan 2018 22:14:35 -0800 (PST) Received: by 10.107.157.12 with HTTP; Mon, 22 Jan 2018 22:14:35 -0800 (PST) In-Reply-To: References: From: Kyle Evans Date: Tue, 23 Jan 2018 00:14:35 -0600 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges To: Yuri Pankov Cc: FreeBSD Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jan 2018 07:15:17 -0000 (apologies for broken quoting, on mobile) On Jan 22, 2018 11:54 PM, "Kyle Evans" wrote: On Mon, Jan 22, 2018 at 6:53 PM, Yuri Pankov wrote: > (CCing Kyle as he's working on regex at the moment and not because he broke > something) Phew, brief moment of panic until I read this line and went back to actually look at the full revision number you wrote. =3Dp I note here that I know incredibly little about collation issues, so feel free to enlighten me. > Hi, > > r302284 added an invalid check which breaks collating ranges: > > -if (table->__collate_load_error) { > - (void)REQUIRE((uch)start <=3D (uch)finish, REG_ERANGE); > +if (table->__collate_load_error || MB_CUR_MAX > 1) { > + (void)REQUIRE(start <=3D finish, REG_ERANGE); > > The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison > according to current locale's collation and not simply comparing the wchar_t > values. Right; that seems like almost the complete opposite of when you might be able to do a direct comparison like this. As you mention, though, even MB_CUR_MAX =3D=3D 1 wouldn't necessarily be safe. > Example -- see Table 1 in http://www.unicode.org/reports/tr10/: > > Let's try Swedish collation: > $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[=C3=B6-z]' > grep: invalid character range > $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[z-=C3=B6]' > > OK, the above seems to be correct, '=C3=B6' > 'z' in Swedish collation, b= ut we > just got lucky here, as wchar_t comparison gives us the same result. > > Now German one: > $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[=C3=B6-z]' > grep: invalid character range > $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[z-=C3=B6]' > > Same, but according to the table, '=C3=B6' < 'z' in German collation! > > I think the fix here would be to drop the "if (table->__collate_load_erro= r > || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so > there's no point in getting it and checking error, wcscoll() which would be > called eventually in p_range_cmp() does the table handling itself, and we > can't use the direct comparison for anything other than 'C' locale (not sure > if it's applicable even there). I've arrived at the same conclusion, and it makes me both sad and happy at the same time. I had a lot more written here, but erased it because I clearly fail to read some of this stuff. I have a proposed patch at [1] that addresses this in theory, but the patch is not right and breaks things in en_US.UTF-8 at the very least. `echo "TEST" | sed -e 's/[s-t]/x/g'` outputs "TExT"; debug output shows that it's adding to the character set: 'S', 's', 't', and U+00DF. I've not been able to dig any deeper and figure out why p_range_comp/wcscoll place these things in the range of [s-t] (115 - 116) with en_US.UTF-8 -- the answer isn't immediately obvious to me. I see now where I goofed this up- please disregard. Do note that this patch might not apply against a clean -head, but the spirit of it is still there. It implements your suggestion, but rewrites the loop in the (former) second branch to keep track of contiguous ranges of characters. This is a (pehaps misguided) attempt to take more of the performance hit at expr compilation time rather than execution, reducing cs->wides as much as possible so that we don't have to do as many comparisons in not worst-case scenarios. I've also rearranged the loops in regex2.h:CHIN1 so that we check the smaller/faster comparisons before checking the larger character set at wide. I don't know if it really makes for any of a performance difference, but it seems worth it to at least benchmark for some more complex character sets one can build in different locales with mixtures of actually contiguous ranges (in the current locale) and randomly distributed non-contiguous characters. [1] https://people.freebsd.org/~kevans/regex-collation.diff