From owner-svn-src-head@freebsd.org Sun Jun 5 19:49:13 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 80C33B6BB24 for ; Sun, 5 Jun 2016 19:49:13 +0000 (UTC) (envelope-from mailing-machine@vniz.net) Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com [209.85.215.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 04F1B1A0B for ; Sun, 5 Jun 2016 19:49:12 +0000 (UTC) (envelope-from mailing-machine@vniz.net) Received: by mail-lf0-f44.google.com with SMTP id s64so82954613lfe.0 for ; Sun, 05 Jun 2016 12:49:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=7sXFmoxOlQEw1PsHiqd89f8kZMoeMeJZsz4gJC9vMU0=; b=Y8GKkj64yRaVNP//EI10J2AViR1C/4pA73xmaNRi+GMsC31LXhQbCmWAQGrukfro6g 6eMeSZIwhZOQSaT8+7ql9NlviXjDNTm0bVIlLG5qTqKozlFnQPKddEjXPB23oYIS/pBH Gghxtjgy3qcxmlOXCxj/bG6YaR/AJy1p8YLAexlE04tJ2suSh3PYwUfPodLeSp+aKEFx 2GbfSYAKo3us8tAbdyejXS9eRe1dYh/ELVLf15Yi9A/cXVz+KzNtv1wdG59JjzYIin5b OqR/8UAQxcQYX14SCN+swlcv0ViJBvqas+HZ2Rpb9kTupBEanT865mKzMi2gOevEOJr5 GPMA== X-Gm-Message-State: ALyK8tJMWt9x9R3LIOUADSj1yxFK+0t4wUUTz2CC+MBV5JXh+9bCHUrlLKXgePqfR88eEQ== X-Received: by 10.25.89.68 with SMTP id n65mr439745lfb.29.1465156144817; Sun, 05 Jun 2016 12:49:04 -0700 (PDT) Received: from [192.168.1.2] ([89.169.173.68]) by smtp.gmail.com with ESMTPSA id x124sm1539403lfd.12.2016.06.05.12.49.03 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 05 Jun 2016 12:49:04 -0700 (PDT) Subject: Re: svn commit: r301461 - in head/lib/libc: gen locale regex To: "Pedro F. Giffuni" , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org References: <201606051912.u55JCqdR036458@repo.freebsd.org> From: Andrey Chernov Message-ID: <40c481fe-5585-45d2-d4e3-b9988a8198f3@freebsd.org> Date: Sun, 5 Jun 2016 22:49:03 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <201606051912.u55JCqdR036458@repo.freebsd.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jun 2016 19:49:13 -0000 On 05.06.2016 22:12, Pedro F. Giffuni wrote: > --- head/lib/libc/regex/regcomp.c Sun Jun 5 18:16:33 2016 (r301460) > +++ head/lib/libc/regex/regcomp.c Sun Jun 5 19:12:52 2016 (r301461) > @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs) > (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); > CHaddrange(p, cs, start, finish); > } else { > - (void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE); > + (void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE); > for (i = 0; i <= UCHAR_MAX; i++) { > - if ( __collate_range_cmp(table, start, i) <= 0 > - && __collate_range_cmp(table, i, finish) <= 0 > + if ( __wcollate_range_cmp(table, start, i) <= 0 > + && __wcollate_range_cmp(table, i, finish) <= 0 > ) > CHadd(p, cs, i); > } > As I already mention in PR, we have broken regcomp after someone adds wchar_t support there. Now regcomp ranges works only for the first 256 wchars of the current locale, notice that loop upper limit: for (i = 0; i <= UCHAR_MAX; i++) { In general, ranges are either broken in regcomp now or are memory eating. We have bitmask only for the first 256 wchars, all other added to the range literally. Imagine what happens if someone specify full Unicode range in regexp. Proper fix will be adding bitmask for the whole Unicode range, and even in that case regcomp attempting to use collation in ranges will be _very_slow_ since needs to check all Unicode chars in its for (i = 0; i <= Max_Unicode_wchar; i++) { loop. Better stop pretending that we are able to do collation support in the ranges, since POSIX cares about its own locale only here: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched." Until whole Unicode range bitmask will be implemented (if ever), better stop pretending to honor collation order, we just can't do it with wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It does not prevent memory eating on big ranges (bitmask is needed, see above), but at least fix the thing that only first 256 wchars are considered.