From owner-freebsd-hackers@freebsd.org Tue Jan 23 06:53:41 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 213C9EB4E8C for ; Tue, 23 Jan 2018 06:53:41 +0000 (UTC) (envelope-from byond.lenox@gmail.com) Received: from mail-it0-f41.google.com (mail-it0-f41.google.com [209.85.214.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E20AF182C for ; Tue, 23 Jan 2018 06:53:40 +0000 (UTC) (envelope-from byond.lenox@gmail.com) Received: by mail-it0-f41.google.com with SMTP id x42so12864895ita.4 for ; Mon, 22 Jan 2018 22:53:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=gsnQvoQ3IsorqMoKazL7eTOKWVOScE7At1dKlboyD9k=; b=Tk6Z2AEfJB0aMEuTPG59uDUK6ldUNPT5zd0jqd0f5M0Z6dz9nohwwV7WQtc1WI+Urm fCQW0wcevFdygIBb+K0aCQ5i4JgijR3INfjeudxJuEe4IHpsZIS6REm2F93J8Vfn7IDA E9MULHdE7T9quCduCJm0P0TAooYHiEGO0Ppkpji8buym+wd16CvptA6wsQmqNv5LWIJT tmqmU5XqpHqd72wnMqtD4qgpr6toBNs3CeKeKetdv4SwCamN1PG2gGW0/LJb2TXz80Rl D0j6FuKkkk//jaSYLeolEIlBhxfuCGJWhCGxx2/qKSsgnneLXsDUjkW6kVQes4AoU1z/ PxYw== X-Gm-Message-State: AKwxytenwBMdQR9VLgl6/eLd/QgnOuN5ZQknl2pwGmU+/4r+OYrgZJxr +u6qYVMzX5xJ+jddYkVJ62k4bZ0b X-Google-Smtp-Source: AH8x226GULjLmy8rM+Tj4DEyxlGQWwnc+viZlUuRaFYZx6CTIZGsaURtyUbUEsL7vDyLLHypYUUA9w== X-Received: by 10.36.190.15 with SMTP id i15mr2368042itf.109.1516686868550; Mon, 22 Jan 2018 21:54:28 -0800 (PST) Received: from mail-io0-f178.google.com (mail-io0-f178.google.com. [209.85.223.178]) by smtp.gmail.com with ESMTPSA id l69sm9084248ioi.11.2018.01.22.21.54.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 Jan 2018 21:54:28 -0800 (PST) Received: by mail-io0-f178.google.com with SMTP id d13so7480218iog.5 for ; Mon, 22 Jan 2018 21:54:28 -0800 (PST) X-Received: by 10.107.50.210 with SMTP id y201mr352171ioy.224.1516686867978; Mon, 22 Jan 2018 21:54:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.157.12 with HTTP; Mon, 22 Jan 2018 21:54:07 -0800 (PST) In-Reply-To: References: From: Kyle Evans Date: Mon, 22 Jan 2018 23:54:07 -0600 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges To: Yuri Pankov Cc: freebsd-hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jan 2018 06:53:41 -0000 On Mon, Jan 22, 2018 at 6:53 PM, Yuri Pankov wrote: > (CCing Kyle as he's working on regex at the moment and not because he bro= ke > something) Phew, brief moment of panic until I read this line and went back to actually look at the full revision number you wrote. =3Dp I note here that I know incredibly little about collation issues, so feel free to enlighten me. > Hi, > > r302284 added an invalid check which breaks collating ranges: > > -if (table->__collate_load_error) { > - (void)REQUIRE((uch)start <=3D (uch)finish, REG_ERANGE); > +if (table->__collate_load_error || MB_CUR_MAX > 1) { > + (void)REQUIRE(start <=3D finish, REG_ERANGE); > > The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison > according to current locale's collation and not simply comparing the wcha= r_t > values. Right; that seems like almost the complete opposite of when you might be able to do a direct comparison like this. As you mention, though, even MB_CUR_MAX =3D=3D 1 wouldn't necessarily be safe. > Example -- see Table 1 in http://www.unicode.org/reports/tr10/: > > Let's try Swedish collation: > $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[=C3=B6-z]' > grep: invalid character range > $ echo 'test' | LC_COLLATE=3Dse_SE.UTF-8 grep '[z-=C3=B6]' > > OK, the above seems to be correct, '=C3=B6' > 'z' in Swedish collation, b= ut we > just got lucky here, as wchar_t comparison gives us the same result. > > Now German one: > $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[=C3=B6-z]' > grep: invalid character range > $ echo 'test' | LC_COLLATE=3Dde_DE.UTF-8 grep '[z-=C3=B6]' > > Same, but according to the table, '=C3=B6' < 'z' in German collation! > > I think the fix here would be to drop the "if (table->__collate_load_erro= r > || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so > there's no point in getting it and checking error, wcscoll() which would = be > called eventually in p_range_cmp() does the table handling itself, and we > can't use the direct comparison for anything other than 'C' locale (not s= ure > if it's applicable even there). I've arrived at the same conclusion, and it makes me both sad and happy at the same time. I had a lot more written here, but erased it because I clearly fail to read some of this stuff. I have a proposed patch at [1] that addresses this in theory, but the patch is not right and breaks things in en_US.UTF-8 at the very least. `echo "TEST" | sed -e 's/[s-t]/x/g'` outputs "TExT"; debug output shows that it's adding to the character set: 'S', 's', 't', and U+00DF. I've not been able to dig any deeper and figure out why p_range_comp/wcscoll place these things in the range of [s-t] (115 - 116) with en_US.UTF-8 -- the answer isn't immediately obvious to me. Do note that this patch might not apply against a clean -head, but the spirit of it is still there. It implements your suggestion, but rewrites the loop in the (former) second branch to keep track of contiguous ranges of characters. This is a (pehaps misguided) attempt to take more of the performance hit at expr compilation time rather than execution, reducing cs->wides as much as possible so that we don't have to do as many comparisons in not worst-case scenarios. I've also rearranged the loops in regex2.h:CHIN1 so that we check the smaller/faster comparisons before checking the larger character set at wide. I don't know if it really makes for any of a performance difference, but it seems worth it to at least benchmark for some more complex character sets one can build in different locales with mixtures of actually contiguous ranges (in the current locale) and randomly distributed non-contiguous characters. [1] https://people.freebsd.org/~kevans/regex-collation.diff