Date: Wed, 30 Apr 2014 16:43:20 -0500 From: Pedro Giffuni <pfg@freebsd.org> To: Jilles Tjoelker <jilles@stack.nl> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r265095 - head/lib/libc/locale Message-ID: <53616E78.3010301@freebsd.org> In-Reply-To: <20140430211028.GA61757@stack.nl> References: <201404291525.s3TFPvmt097589@svn.freebsd.org> <20140430211028.GA61757@stack.nl>
next in thread | previous in thread | raw e-mail | index | archive | help
On 04/30/14 16:10, Jilles Tjoelker wrote: > On Tue, Apr 29, 2014 at 03:25:57PM +0000, Pedro F. Giffuni wrote: >> Author: pfg >> Date: Tue Apr 29 15:25:57 2014 >> New Revision: 265095 >> URL: http://svnweb.freebsd.org/changeset/base/265095 >> Log: >> citrus: Avoid invalid code points. >> >> From the OpenBSD log: >> The UTF-8 decoder should not accept byte sequences which decode to unicode >> code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF. >> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 >> http://unicode.org/faq/utf_bom.html#utf8-4 >> Reported by: Stefan Sperling >> Obtained from: OpenBSD >> MFC after: 5 days >> Modified: >> head/lib/libc/locale/utf8.c >> Modified: head/lib/libc/locale/utf8.c >> ============================================================================== >> --- head/lib/libc/locale/utf8.c Tue Apr 29 15:12:23 2014 (r265094) >> +++ head/lib/libc/locale/utf8.c Tue Apr 29 15:25:57 2014 (r265095) >> @@ -203,6 +203,14 @@ _UTF8_mbrtowc(wchar_t * __restrict pwc, >> errno = EILSEQ; >> return ((size_t)-1); >> } >> + if ((wch >= 0xd800 && wch <= 0xdfff) || >> + wch == 0xfffe || wch == 0xffff) { >> + /* >> + * Malformed input; invalid code points. >> + */ >> + errno = EILSEQ; >> + return ((size_t)-1); >> + } >> if (pwc != NULL) >> *pwc = wch; >> us->want = 0; > Hmm, I think U+FFFE and U+FFFF should be passed through normally. > According to http://www.unicode.org/faq/private_use.html they are > "noncharacters" (basically a more private variant of private-use > characters) and must be mapped through UTFs. > > The part that rejects U+D800 to U+DFFF is definitely correct, though. > http://unicode.org/faq/utf_bom.html#utf8-4 tells to do only that. > > The part about U+FFFE and U+FFFF in > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 seems out of date. > Note the last modified date of that page: 2009-05-11. > > On another note, everything above U+0010FFFF should perhaps be rejected > since those codes, which cannot be encoded in UTF-16, were excluded from > Unicode and ISO 10646. > Thank you! I will fix soon the UTF-8 part. Pedro.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53616E78.3010301>