From owner-svn-src-head@FreeBSD.ORG Wed Apr 30 21:10:32 2014 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5843BC55; Wed, 30 Apr 2014 21:10:32 +0000 (UTC) Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mailhost.stack.nl", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 1E88716EE; Wed, 30 Apr 2014 21:10:32 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id DD864B805F; Wed, 30 Apr 2014 23:10:28 +0200 (CEST) Received: by snail.stack.nl (Postfix, from userid 1677) id CBEAD28497; Wed, 30 Apr 2014 23:10:28 +0200 (CEST) Date: Wed, 30 Apr 2014 23:10:28 +0200 From: Jilles Tjoelker To: "Pedro F. Giffuni" Subject: Re: svn commit: r265095 - head/lib/libc/locale Message-ID: <20140430211028.GA61757@stack.nl> References: <201404291525.s3TFPvmt097589@svn.freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201404291525.s3TFPvmt097589@svn.freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Apr 2014 21:10:32 -0000 On Tue, Apr 29, 2014 at 03:25:57PM +0000, Pedro F. Giffuni wrote: > Author: pfg > Date: Tue Apr 29 15:25:57 2014 > New Revision: 265095 > URL: http://svnweb.freebsd.org/changeset/base/265095 > Log: > citrus: Avoid invalid code points. > > From the OpenBSD log: > The UTF-8 decoder should not accept byte sequences which decode to unicode > code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF. > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 > http://unicode.org/faq/utf_bom.html#utf8-4 > Reported by: Stefan Sperling > Obtained from: OpenBSD > MFC after: 5 days > Modified: > head/lib/libc/locale/utf8.c > Modified: head/lib/libc/locale/utf8.c > ============================================================================== > --- head/lib/libc/locale/utf8.c Tue Apr 29 15:12:23 2014 (r265094) > +++ head/lib/libc/locale/utf8.c Tue Apr 29 15:25:57 2014 (r265095) > @@ -203,6 +203,14 @@ _UTF8_mbrtowc(wchar_t * __restrict pwc, > errno = EILSEQ; > return ((size_t)-1); > } > + if ((wch >= 0xd800 && wch <= 0xdfff) || > + wch == 0xfffe || wch == 0xffff) { > + /* > + * Malformed input; invalid code points. > + */ > + errno = EILSEQ; > + return ((size_t)-1); > + } > if (pwc != NULL) > *pwc = wch; > us->want = 0; Hmm, I think U+FFFE and U+FFFF should be passed through normally. According to http://www.unicode.org/faq/private_use.html they are "noncharacters" (basically a more private variant of private-use characters) and must be mapped through UTFs. The part that rejects U+D800 to U+DFFF is definitely correct, though. http://unicode.org/faq/utf_bom.html#utf8-4 tells to do only that. The part about U+FFFE and U+FFFF in http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 seems out of date. Note the last modified date of that page: 2009-05-11. On another note, everything above U+0010FFFF should perhaps be rejected since those codes, which cannot be encoded in UTF-16, were excluded from Unicode and ISO 10646. -- Jilles Tjoelker