From owner-svn-src-head@FreeBSD.ORG  Wed Apr 30 21:10:32 2014
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5843BC55;
 Wed, 30 Apr 2014 21:10:32 +0000 (UTC)
Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client CN "mailhost.stack.nl",
 Issuer "CA Cert Signing Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 1E88716EE;
 Wed, 30 Apr 2014 21:10:32 +0000 (UTC)
Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131])
 by mx1.stack.nl (Postfix) with ESMTP id DD864B805F;
 Wed, 30 Apr 2014 23:10:28 +0200 (CEST)
Received: by snail.stack.nl (Postfix, from userid 1677)
 id CBEAD28497; Wed, 30 Apr 2014 23:10:28 +0200 (CEST)
Date: Wed, 30 Apr 2014 23:10:28 +0200
From: Jilles Tjoelker <jilles@stack.nl>
To: "Pedro F. Giffuni" <pfg@FreeBSD.org>
Subject: Re: svn commit: r265095 - head/lib/libc/locale
Message-ID: <20140430211028.GA61757@stack.nl>
References: <201404291525.s3TFPvmt097589@svn.freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201404291525.s3TFPvmt097589@svn.freebsd.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Apr 2014 21:10:32 -0000

On Tue, Apr 29, 2014 at 03:25:57PM +0000, Pedro F. Giffuni wrote:
> Author: pfg
> Date: Tue Apr 29 15:25:57 2014
> New Revision: 265095
> URL: http://svnweb.freebsd.org/changeset/base/265095

> Log:
>   citrus: Avoid invalid code points.
>   
>   From the OpenBSD log:
>   The UTF-8 decoder should not accept byte sequences which decode to unicode
>   code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF.

>   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
>   http://unicode.org/faq/utf_bom.html#utf8-4

>   Reported by:	Stefan Sperling
>   Obtained from:	OpenBSD
>   MFC after:	5 days

> Modified:
>   head/lib/libc/locale/utf8.c

> Modified: head/lib/libc/locale/utf8.c
> ==============================================================================
> --- head/lib/libc/locale/utf8.c	Tue Apr 29 15:12:23 2014	(r265094)
> +++ head/lib/libc/locale/utf8.c	Tue Apr 29 15:25:57 2014	(r265095)
> @@ -203,6 +203,14 @@ _UTF8_mbrtowc(wchar_t * __restrict pwc, 
>  		errno = EILSEQ;
>  		return ((size_t)-1);
>  	}
> +	if ((wch >= 0xd800 && wch <= 0xdfff) ||
> +	    wch == 0xfffe || wch == 0xffff) {
> +		/*
> +		 * Malformed input; invalid code points.
> +		 */
> +		errno = EILSEQ;
> +		return ((size_t)-1);
> +	}
>  	if (pwc != NULL)
>  		*pwc = wch;
>  	us->want = 0;

Hmm, I think U+FFFE and U+FFFF should be passed through normally.
According to http://www.unicode.org/faq/private_use.html they are
"noncharacters" (basically a more private variant of private-use
characters) and must be mapped through UTFs.

The part that rejects U+D800 to U+DFFF is definitely correct, though.
http://unicode.org/faq/utf_bom.html#utf8-4 tells to do only that.

The part about U+FFFE and U+FFFF in
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 seems out of date.
Note the last modified date of that page: 2009-05-11.

On another note, everything above U+0010FFFF should perhaps be rejected
since those codes, which cannot be encoded in UTF-16, were excluded from
Unicode and ISO 10646.

-- 
Jilles Tjoelker