From owner-svn-src-all@FreeBSD.ORG Tue May 21 22:00:06 2013 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 64685500; Tue, 21 May 2013 22:00:06 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104]) by mx1.freebsd.org (Postfix) with ESMTP id E91D1132; Tue, 21 May 2013 22:00:05 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id C085935930D; Wed, 22 May 2013 00:00:03 +0200 (CEST) Received: by snail.stack.nl (Postfix, from userid 1677) id AD00228493; Wed, 22 May 2013 00:00:03 +0200 (CEST) Date: Wed, 22 May 2013 00:00:03 +0200 From: Jilles Tjoelker To: Ed Schouten Subject: Re: svn commit: r250883 - in head: include include/xlocale lib/libc/locale sys/sys tools/regression/lib/libc/locale Message-ID: <20130521220003.GB58299@stack.nl> References: <201305211959.r4LJxbLx034714@svn.freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201305211959.r4LJxbLx034714@svn.freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 May 2013 22:00:06 -0000 On Tue, May 21, 2013 at 07:59:37PM +0000, Ed Schouten wrote: > Author: ed > Date: Tue May 21 19:59:37 2013 > New Revision: 250883 > URL: http://svnweb.freebsd.org/changeset/base/250883 > Log: > Add . Looks like an interesting approach to make applications LC_CTYPE locale aware. > The header, part of C11, adds a small number of utility > functions for 16/32-bit "universal" characters, which may or may not be > UTF-16/32. As our wchar_t is already ISO 10646, simply add light-weight > wrappers around wcrtomb() and mbrtowc(). Our wchar_t is only ISO 10646 for UTF-8 and possibly US-ASCII and ISO8859-1 (subset) locales. However, we support various other charsets such as ISO8859-2 and even some non-Unicode multibyte encodings (although I have no idea how complete the support for the latter is). Using the new functions with such locales is likely to produce garbage. Some sort of iconv is necessary to avoid this mojibake and we do not have it in base. I encountered the same issue when implementing $'\uXXXX' and $'\UXXXXXXXX' for /bin/sh. I decided to implement full support only for UTF-8 locales and restrict the rest to US-ASCII, converting all other (valid) characters to question marks. I did not see a reason to treat Western Europe preferentially by adding a simple ISO8859-1 mapping. For these functions, it is probably best to fail (such as with [EILSEQ]) when a character is encountered for which the conversion is not implemented. > While there, also add (non-yet-standard) _l functions, similar to the > ones we already have for the other locale-dependent functions. This makes sense. > [snip] > Added: head/lib/libc/locale/c16rtomb.c > ============================================================================== > --- /dev/null 00:00:00 1970 (empty, because file is newly added) > +++ head/lib/libc/locale/c16rtomb.c Tue May 21 19:59:37 2013 (r250883) > [snip] > +typedef struct { > + char16_t lead_surrogate; > + mbstate_t c32_mbstate; > +} _Char16State; > + > +size_t > +c16rtomb_l(char * __restrict s, char16_t c16, mbstate_t * __restrict ps, > + locale_t locale) > +{ > + _Char16State *cs; > + char32_t c32; > + > + FIX_LOCALE(locale); > + if (ps == NULL) > + ps = &locale->c16rtomb; > + cs = (_Char16State *)ps; > + > + /* If s is a null pointer, the value of parameter c16 is ignored. */ > + if (s == NULL) { > + c32 = 0; > + } else if (cs->lead_surrogate >= 0xd800 && > + cs->lead_surrogate <= 0xdbff) { > + /* We should see a trail surrogate now. */ > + if (c16 < 0xdc00 || c16 > 0xdfff) { > + errno = EILSEQ; > + return ((size_t)-1); > + } > + c32 = 0x10000 + ((cs->lead_surrogate & 0x3ff) << 10 | > + (c16 & 0x3ff)); > + } else if (c16 >= 0xd800 && c16 <= 0xdbff) { > + /* Store lead surrogate for next invocation. */ > + cs->lead_surrogate = c16; > + return (0); > + } else { > + /* Regular character. */ > + c32 = c16; > + } > + cs->lead_surrogate = 0; > + > + return (c32rtomb_l(s, c32, &cs->c32_mbstate, locale)); > +} You can probably avoid the potential buffer overflow by copying the trailing 126 bytes of the _Char16State into a new mbstate_t object, and back after calling c32rtomb_l(). Likewise for mbrtoc16_l(). (The committed code allows using 120 bytes which should still be more than enough.) There may also be a strict-aliasing violation if ps == NULL, as the mbstate_t objects in __xlocale_C_locale and __xlocale_global_locale have a declared type. If ps != NULL, it is OK as long as the object is a separate compilation unit. -- Jilles Tjoelker