From owner-freebsd-hackers@FreeBSD.ORG  Wed Nov  5 21:37:15 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4A3D7106567E
	for <freebsd-hackers@freebsd.org>; Wed,  5 Nov 2008 21:37:15 +0000 (UTC)
	(envelope-from kientzle@freebsd.org)
Received: from kientzle.com (kientzle.com [66.166.149.50])
	by mx1.freebsd.org (Postfix) with ESMTP id 1F0CC8FC12
	for <freebsd-hackers@freebsd.org>; Wed,  5 Nov 2008 21:37:15 +0000 (UTC)
	(envelope-from kientzle@freebsd.org)
Received: from [10.123.2.205] (p53.kientzle.com [66.166.149.53])
	by kientzle.com (8.12.9/8.12.9) with ESMTP id mA5LbEtv035181;
	Wed, 5 Nov 2008 13:37:14 -0800 (PST)
	(envelope-from kientzle@freebsd.org)
Message-ID: <49121206.9090804@freebsd.org>
Date: Wed, 05 Nov 2008 13:37:10 -0800
From: Tim Kientzle <kientzle@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060422
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Maksim Yevmenkin <maksim.yevmenkin@gmail.com>
References: <bb4a86c70811041554k6b55854cw711fab508278e398@mail.gmail.com>
In-Reply-To: <bb4a86c70811041554k6b55854cw711fab508278e398@mail.gmail.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-hackers@freebsd.org
Subject: Re: converting strings from utf8
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Nov 2008 21:37:15 -0000

Maksim Yevmenkin wrote:
> 
> can i use wcstombs(3) to convert a string presented in utf8 into
> current locale? basically i'm looking for something like iconv from
> ports but included into base system.

This isn't as easy as it should be, unfortunately.
First, UTF-8 is itself a multibyte encoding, so you have
to first convert to wide characters before you can use
wcstombs().  You could in theory use the following:
   * Set locale to UTF-8
   * use mbstowcs() to convert UTF-8 into wide characters
   * Set locale to your preferred locale
   * use wcstombs() to convert wide characters to your locale

Besides being ugly, the locale names themselves are not
standardized, so it's hard to do this portably.  For a
lot of applications, the error handling in wcstombs() is
also troublesome; it rejects the entire string if any one
character can't be converted.

When I had to do this for libarchive, where the code had
to be very portable (which precluded using iconv), I ended
up doing the following:
  * Wrote my own converter from UTF-8 to wide characters
    (fortunately, UTF-8 is pretty simple to decode; this
     is about 20-30 lines of C)
  * Used wctomb() to convert one character at a time from
     wide characters to the current locale.

I've found that wctomb() is more portable than a lot of
the other functions (I think it's in C89, whereas a lot
of the other standard conversion routines were introduced
in C99) and provides better error-handling capabilities
since it operates on one character at a time (so you
can, for instance, convert characters that aren't
supported in the current locale into '?' or some kind
of \-escape).

Feel free to copy any of my code from libarchive if it helps.

Tim