From owner-freebsd-hackers@freebsd.org Tue Jan 23 19:28:24 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E7065EC1FF8 for ; Tue, 23 Jan 2018 19:28:24 +0000 (UTC) (envelope-from byond.lenox@gmail.com) Received: from mail-io0-f177.google.com (mail-io0-f177.google.com [209.85.223.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B53306ACC1 for ; Tue, 23 Jan 2018 19:28:24 +0000 (UTC) (envelope-from byond.lenox@gmail.com) Received: by mail-io0-f177.google.com with SMTP id f34so2126274ioi.13 for ; Tue, 23 Jan 2018 11:28:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=iCbtsUDVGJ37fr6Qut1EpyFURUB6EyAGTOvUf6BXHjk=; b=FsZqYI1JpjR6IBoyaewN7GZHnQADFAxAtDsF6yYdl/EZ3ie0ZiRvi3+lYnvUbsIuCi gcXsfHFdNjd9y3xsW6QsMRQoD8E1j/6rfwaQ+nacGv0hQCTmkA0UlIDGrmPLEbLwX/oi qstf3Gv2J4wB4ME+se5HidpJZZ9qN7magErsTB5Q2aJYoIKQCl500jkQ7qXNLUjm36l5 G3ohsMaMLKHXT+69I5Un6Y6IMgBDzeI3LcJUbbCFltHDDxJkTRxlZq4LlhmMhmzVdnJK SZ2bFZ1AOJ4+wpeDuduG7qp9pS+QQZkV3i+J9FixOrSQBXXDkKspMPG/eGzaL+T48/LZ GXiQ== X-Gm-Message-State: AKwxytf8ArjR2meM2XBc9UpW75zJaME8x7tXxejgDh6RVJ695nKxvaV4 6msItX+mvGPsYarOzEGtLLwg/Wtk X-Google-Smtp-Source: AH8x225dc2NJQ9Eh+LtZw9Bv6a1on+lttvFBIc9mu0hRIW79FyoMdTGdRL/hcUI//au3IHhCtmwREQ== X-Received: by 10.107.53.22 with SMTP id c22mr5162982ioa.189.1516735345685; Tue, 23 Jan 2018 11:22:25 -0800 (PST) Received: from mail-it0-f51.google.com (mail-it0-f51.google.com. [209.85.214.51]) by smtp.gmail.com with ESMTPSA id e18sm5621921itc.4.2018.01.23.11.22.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Jan 2018 11:22:25 -0800 (PST) Received: by mail-it0-f51.google.com with SMTP id 196so2277128iti.5 for ; Tue, 23 Jan 2018 11:22:25 -0800 (PST) X-Received: by 10.36.179.67 with SMTP id z3mr5484443iti.67.1516735344921; Tue, 23 Jan 2018 11:22:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.157.12 with HTTP; Tue, 23 Jan 2018 11:22:04 -0800 (PST) In-Reply-To: References: From: Kyle Evans Date: Tue, 23 Jan 2018 13:22:04 -0600 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: libc/regex: r302824 added invalid check breaking collating ranges To: Yuri Pankov Cc: FreeBSD Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jan 2018 19:28:25 -0000 On Tue, Jan 23, 2018 at 1:10 PM, Yuri Pankov wrote: > On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote: >> >> On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov wrote: >>> >>> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote: >>>> >>>> >>>> (CCing Kyle as he's working on regex at the moment and not because he >>>> broke something) >>>> >>>> Hi, >>>> >>>> r302284 added an invalid check which breaks collating ranges: >>>> >>>> -if (table->__collate_load_error) { >>>> - (void)REQUIRE((uch)start <=3D (uch)finish, REG_ERANGE); >>>> +if (table->__collate_load_error || MB_CUR_MAX > 1) { >>>> + (void)REQUIRE(start <=3D finish, REG_ERANGE); >>>> >>>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison >>>> according to current locale's collation and not simply comparing the >>>> wchar_t values. >>> >>> >>> >>> After re-reading the specification I now see that what looked like a bu= g >>> is >>> actually an implementation choice, though the one that needs to be >>> documentated. I'll update the man page if anyone is willing to review >>> (and >>> commit) the changes. >> >> >> Can you point to the section of specification that indicates this is >> OK behavior? It doesn't seem desirable, but I see that GNU systems >> will operate in the same manner that we do now. > > > Here -- > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html: > ------------------------------------------------------------------------ > In the POSIX locale, a range expression represents the set of collating > elements that fall between two elements in the collation sequence, > inclusive. In other locales, a range expression has unspecified behavior: > strictly conforming applications shall not rely on whether the range > expression is valid, or on the set of collating elements matched. > ------------------------------------------------------------------------ > Thanks- our current behavior seems reasonable in that context. > I've tried to "fix" what I was seeing as well, and yes, everything outsid= e > of ASCII is ugly, e.g. Cyrillic '=D0=B0-=D1=8F' would match much more tha= n you could > expect if you are doing lookups based on collation order (capital chars a= nd > a lot of other symbols). > > So what we have currently looks the least evil to me: > > - non-collating ASCII lookups for any locale -- looking at the log for > regcomp.c there was an attempt to "fix" this, but it was reverted as > a lot of existing code relies on this; > - non-collating multi-byte locale lookups -- they will work for almost > all cases, and where they don't, well POSIX says it's undefined :D > - collating single-byte locale lookups for outside of ASCII range -- > they make sense as collation order there doesn't seem to mix > small/caps/other characters together. > > What I think we need to do is document this as implementation choice in t= he > code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul doesn'= t > come trying to fix it as I did :-) I agree with your assessment- such a patch would be welcomed, especially before I go and revise a bunch of this for clarification in a future libregex world.