From owner-svn-src-all@FreeBSD.ORG  Tue May 21 22:00:06 2013
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 64685500;
 Tue, 21 May 2013 22:00:06 +0000 (UTC) (envelope-from jilles@stack.nl)
Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104])
 by mx1.freebsd.org (Postfix) with ESMTP id E91D1132;
 Tue, 21 May 2013 22:00:05 +0000 (UTC)
Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131])
 by mx1.stack.nl (Postfix) with ESMTP id C085935930D;
 Wed, 22 May 2013 00:00:03 +0200 (CEST)
Received: by snail.stack.nl (Postfix, from userid 1677)
 id AD00228493; Wed, 22 May 2013 00:00:03 +0200 (CEST)
Date: Wed, 22 May 2013 00:00:03 +0200
From: Jilles Tjoelker <jilles@stack.nl>
To: Ed Schouten <ed@FreeBSD.org>
Subject: Re: svn commit: r250883 - in head: include include/xlocale
 lib/libc/locale sys/sys tools/regression/lib/libc/locale
Message-ID: <20130521220003.GB58299@stack.nl>
References: <201305211959.r4LJxbLx034714@svn.freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201305211959.r4LJxbLx034714@svn.freebsd.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 21 May 2013 22:00:06 -0000

On Tue, May 21, 2013 at 07:59:37PM +0000, Ed Schouten wrote:
> Author: ed
> Date: Tue May 21 19:59:37 2013
> New Revision: 250883
> URL: http://svnweb.freebsd.org/changeset/base/250883

> Log:
>   Add <uchar.h>.

Looks like an interesting approach to make applications LC_CTYPE locale
aware.

>   The <uchar.h> header, part of C11, adds a small number of utility
>   functions for 16/32-bit "universal" characters, which may or may not be
>   UTF-16/32. As our wchar_t is already ISO 10646, simply add light-weight
>   wrappers around wcrtomb() and mbrtowc().

Our wchar_t is only ISO 10646 for UTF-8 and possibly US-ASCII and
ISO8859-1 (subset) locales. However, we support various other charsets
such as ISO8859-2 and even some non-Unicode multibyte encodings
(although I have no idea how complete the support for the latter is).
Using the new functions with such locales is likely to produce garbage.
Some sort of iconv is necessary to avoid this mojibake and we do not
have it in base.

I encountered the same issue when implementing $'\uXXXX' and
$'\UXXXXXXXX' for /bin/sh. I decided to implement full support only for
UTF-8 locales and restrict the rest to US-ASCII, converting all other
(valid) characters to question marks. I did not see a reason to treat
Western Europe preferentially by adding a simple ISO8859-1 mapping.

For these functions, it is probably best to fail (such as with [EILSEQ])
when a character is encountered for which the conversion is not
implemented.

>   While there, also add (non-yet-standard) _l functions, similar to the
>   ones we already have for the other locale-dependent functions.

This makes sense.

> [snip]
> Added: head/lib/libc/locale/c16rtomb.c
> ==============================================================================
> --- /dev/null	00:00:00 1970	(empty, because file is newly added)
> +++ head/lib/libc/locale/c16rtomb.c	Tue May 21 19:59:37 2013	(r250883)
> [snip]
> +typedef struct {
> +	char16_t	lead_surrogate;
> +	mbstate_t	c32_mbstate;
> +} _Char16State;
> +
> +size_t
> +c16rtomb_l(char * __restrict s, char16_t c16, mbstate_t * __restrict ps,
> +    locale_t locale)
> +{
> +	_Char16State *cs;
> +	char32_t c32;
> +
> +	FIX_LOCALE(locale);
> +	if (ps == NULL)
> +		ps = &locale->c16rtomb;
> +	cs = (_Char16State *)ps;
> +
> +	/* If s is a null pointer, the value of parameter c16 is ignored. */
> +	if (s == NULL) {
> +		c32 = 0;
> +	} else if (cs->lead_surrogate >= 0xd800 &&
> +	    cs->lead_surrogate <= 0xdbff) {
> +		/* We should see a trail surrogate now. */
> +		if (c16 < 0xdc00 || c16 > 0xdfff) {
> +			errno = EILSEQ;
> +			return ((size_t)-1);
> +		}
> +		c32 = 0x10000 + ((cs->lead_surrogate & 0x3ff) << 10 |
> +		    (c16 & 0x3ff));
> +	} else if (c16 >= 0xd800 && c16 <= 0xdbff) {
> +		/* Store lead surrogate for next invocation. */
> +		cs->lead_surrogate = c16;
> +		return (0);
> +	} else {
> +		/* Regular character. */
> +		c32 = c16;
> +	}
> +	cs->lead_surrogate = 0;
> +
> +	return (c32rtomb_l(s, c32, &cs->c32_mbstate, locale));
> +}

You can probably avoid the potential buffer overflow by copying the
trailing 126 bytes of the _Char16State into a new mbstate_t object, and
back after calling c32rtomb_l(). Likewise for mbrtoc16_l(). (The
committed code allows using 120 bytes which should still be more than
enough.)

There may also be a strict-aliasing violation if ps == NULL, as the
mbstate_t objects in __xlocale_C_locale and __xlocale_global_locale have
a declared type. If ps != NULL, it is OK as long as the object is a
separate compilation unit.

-- 
Jilles Tjoelker