From owner-svn-src-head@freebsd.org Wed Dec 12 04:23:02 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 54ACA132B29E; Wed, 12 Dec 2018 04:23:02 +0000 (UTC) (envelope-from yuripv@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EA51F70D60; Wed, 12 Dec 2018 04:23:01 +0000 (UTC) (envelope-from yuripv@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id DF6BF2A58; Wed, 12 Dec 2018 04:23:01 +0000 (UTC) (envelope-from yuripv@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id wBC4N1fZ024489; Wed, 12 Dec 2018 04:23:01 GMT (envelope-from yuripv@FreeBSD.org) Received: (from yuripv@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id wBC4N10E024486; Wed, 12 Dec 2018 04:23:01 GMT (envelope-from yuripv@FreeBSD.org) Message-Id: <201812120423.wBC4N10E024486@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: yuripv set sender to yuripv@FreeBSD.org using -f From: Yuri Pankov Date: Wed, 12 Dec 2018 04:23:01 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r341838 - in head/lib/libc: regex tests/regex X-SVN-Group: head X-SVN-Commit-Author: yuripv X-SVN-Commit-Paths: in head/lib/libc: regex tests/regex X-SVN-Commit-Revision: 341838 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: EA51F70D60 X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-0.82 / 15.00]; local_wl_from(0.00)[FreeBSD.org]; NEURAL_HAM_SHORT(-0.82)[-0.819,0]; ASN(0.00)[asn:11403, ipnet:2610:1c1:1::/48, country:US] X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Dec 2018 04:23:02 -0000 Author: yuripv Date: Wed Dec 12 04:23:00 2018 New Revision: 341838 URL: https://svnweb.freebsd.org/changeset/base/341838 Log: regcomp: reduce size of bitmap for multibyte locales This fixes the obscure endless loop seen with case-insensitive patterns containing characters in 128-255 range; originally found running GNU grep test suite. Our regex implementation being kludgy translates the characters in case-insensitive pattern to bracket expression containing both cases for the character and doesn't correctly handle the case when original character is in bitmap and the other case is not, falling into the endless loop going through in p_bracket(), ordinary(), and bothcases(). Reducing the bitmap to 0-127 range for multibyte locales solves this as none of these characters have other case mapping outside of bitmap. We are also safe in the case when the original character outside of bitmap has other case mapping in the bitmap (there are several of those in our current ctype maps having unidirectional mapping into bitmap). Reviewed by: bapt, kevans, pfg Differential revision: https://reviews.freebsd.org/D18302 Modified: head/lib/libc/regex/regcomp.c head/lib/libc/regex/regex2.h head/lib/libc/regex/utils.h head/lib/libc/tests/regex/multibyte.sh Modified: head/lib/libc/regex/regcomp.c ============================================================================== --- head/lib/libc/regex/regcomp.c Wed Dec 12 02:33:01 2018 (r341837) +++ head/lib/libc/regex/regcomp.c Wed Dec 12 04:23:00 2018 (r341838) @@ -1841,21 +1841,29 @@ computejumps(struct parse *p, struct re_guts *g) { int ch; int mindex; + int cmin, cmax; + /* + * For UTF-8 we process only the first 128 characters corresponding to + * the POSIX locale. + */ + cmin = MB_CUR_MAX == 1 ? CHAR_MIN : 0; + cmax = MB_CUR_MAX == 1 ? CHAR_MAX : 127; + /* Avoid making errors worse */ if (p->error != 0) return; - g->charjump = (int*) malloc((NC + 1) * sizeof(int)); + g->charjump = (int *)malloc((cmax - cmin + 1) * sizeof(int)); if (g->charjump == NULL) /* Not a fatal error */ return; /* Adjust for signed chars, if necessary */ - g->charjump = &g->charjump[-(CHAR_MIN)]; + g->charjump = &g->charjump[-(cmin)]; /* If the character does not exist in the pattern, the jump * is equal to the number of characters in the pattern. */ - for (ch = CHAR_MIN; ch < (CHAR_MAX + 1); ch++) + for (ch = cmin; ch < cmax + 1; ch++) g->charjump[ch] = g->mlen; /* If the character does exist, compute the jump that would Modified: head/lib/libc/regex/regex2.h ============================================================================== --- head/lib/libc/regex/regex2.h Wed Dec 12 02:33:01 2018 (r341837) +++ head/lib/libc/regex/regex2.h Wed Dec 12 04:23:00 2018 (r341838) @@ -113,7 +113,7 @@ typedef struct { wint_t max; } crange; typedef struct { - unsigned char bmp[NC / 8]; + unsigned char bmp[NC_MAX / 8]; wctype_t *types; unsigned int ntypes; wint_t *wides; @@ -133,9 +133,14 @@ CHIN1(cset *cs, wint_t ch) if (ch < NC) return (((cs->bmp[ch >> 3] & (1 << (ch & 7))) != 0) ^ cs->invert); - for (i = 0; i < cs->nwides; i++) - if (ch == cs->wides[i]) + for (i = 0; i < cs->nwides; i++) { + if (cs->icase) { + if (ch == towlower(cs->wides[i]) || + ch == towupper(cs->wides[i])) + return (!cs->invert); + } else if (ch == cs->wides[i]) return (!cs->invert); + } for (i = 0; i < cs->nranges; i++) if (cs->ranges[i].min <= ch && ch <= cs->ranges[i].max) return (!cs->invert); Modified: head/lib/libc/regex/utils.h ============================================================================== --- head/lib/libc/regex/utils.h Wed Dec 12 02:33:01 2018 (r341837) +++ head/lib/libc/regex/utils.h Wed Dec 12 04:23:00 2018 (r341838) @@ -39,7 +39,9 @@ /* utility definitions */ #define DUPMAX _POSIX2_RE_DUP_MAX /* xxx is this right? */ #define INFINITY (DUPMAX + 1) -#define NC (CHAR_MAX - CHAR_MIN + 1) + +#define NC_MAX (CHAR_MAX - CHAR_MIN + 1) +#define NC ((MB_CUR_MAX) == 1 ? (NC_MAX) : (128)) typedef unsigned char uch; /* switch off assertions (if not already off) if no REDEBUG */ Modified: head/lib/libc/tests/regex/multibyte.sh ============================================================================== --- head/lib/libc/tests/regex/multibyte.sh Wed Dec 12 02:33:01 2018 (r341837) +++ head/lib/libc/tests/regex/multibyte.sh Wed Dec 12 04:23:00 2018 (r341838) @@ -1,11 +1,11 @@ # $FreeBSD$ -atf_test_case multibyte -multibyte_head() +atf_test_case bmpat +bmpat_head() { atf_set "descr" "Check matching multibyte characters (PR153502)" } -multibyte_body() +bmpat_body() { export LC_CTYPE="C.UTF-8" @@ -29,7 +29,25 @@ multibyte_body() sed -ne '/.a./p' } +atf_test_case icase +icase_head() +{ + atf_set "descr" "Check case-insensitive matching for characters 128-255" +} +icase_body() +{ + export LC_CTYPE="C.UTF-8" + + a=$(printf '\302\265\n') # U+00B5 + b=$(printf '\316\234\n') # U+039C + c=$(printf '\316\274\n') # U+03BC + + echo $b | atf_check -o "inline:$b\n" sed -ne "/$a/Ip" + echo $c | atf_check -o "inline:$c\n" sed -ne "/$a/Ip" +} + atf_init_test_cases() { - atf_add_test_case multibyte + atf_add_test_case bmpat + atf_add_test_case icase }