Date: Tue, 23 Jan 2018 03:53:19 +0300 From: Yuri Pankov <yuripv@icloud.com> To: freebsd-hackers <freebsd-hackers@freebsd.org>, Kyle Evans <kevans@FreeBSD.org> Subject: libc/regex: r302824 added invalid check breaking collating ranges Message-ID: <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com>
next in thread | raw e-mail | index | archive | help
(CCing Kyle as he's working on regex at the moment and not because he broke something) Hi, r302284 added an invalid check which breaks collating ranges: -if (table->__collate_load_error) { - (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); +if (table->__collate_load_error || MB_CUR_MAX > 1) { + (void)REQUIRE(start <= finish, REG_ERANGE); The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison according to current locale's collation and not simply comparing the wchar_t values. Example -- see Table 1 in http://www.unicode.org/reports/tr10/: Let's try Swedish collation: $ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[ö-z]' grep: invalid character range $ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[z-ö]' OK, the above seems to be correct, 'ö' > 'z' in Swedish collation, but we just got lucky here, as wchar_t comparison gives us the same result. Now German one: $ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[ö-z]' grep: invalid character range $ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[z-ö]' Same, but according to the table, 'ö' < 'z' in German collation! I think the fix here would be to drop the "if (table->__collate_load_error || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so there's no point in getting it and checking error, wcscoll() which would be called eventually in p_range_cmp() does the table handling itself, and we can't use the direct comparison for anything other than 'C' locale (not sure if it's applicable even there).
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a0d9abd8-19b8-cdf6-5451-e184fa182b38>