From owner-freebsd-hackers@FreeBSD.ORG Wed Nov 5 21:37:15 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4A3D7106567E for ; Wed, 5 Nov 2008 21:37:15 +0000 (UTC) (envelope-from kientzle@freebsd.org) Received: from kientzle.com (kientzle.com [66.166.149.50]) by mx1.freebsd.org (Postfix) with ESMTP id 1F0CC8FC12 for ; Wed, 5 Nov 2008 21:37:15 +0000 (UTC) (envelope-from kientzle@freebsd.org) Received: from [10.123.2.205] (p53.kientzle.com [66.166.149.53]) by kientzle.com (8.12.9/8.12.9) with ESMTP id mA5LbEtv035181; Wed, 5 Nov 2008 13:37:14 -0800 (PST) (envelope-from kientzle@freebsd.org) Message-ID: <49121206.9090804@freebsd.org> Date: Wed, 05 Nov 2008 13:37:10 -0800 From: Tim Kientzle User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060422 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Maksim Yevmenkin References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: converting strings from utf8 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Nov 2008 21:37:15 -0000 Maksim Yevmenkin wrote: > > can i use wcstombs(3) to convert a string presented in utf8 into > current locale? basically i'm looking for something like iconv from > ports but included into base system. This isn't as easy as it should be, unfortunately. First, UTF-8 is itself a multibyte encoding, so you have to first convert to wide characters before you can use wcstombs(). You could in theory use the following: * Set locale to UTF-8 * use mbstowcs() to convert UTF-8 into wide characters * Set locale to your preferred locale * use wcstombs() to convert wide characters to your locale Besides being ugly, the locale names themselves are not standardized, so it's hard to do this portably. For a lot of applications, the error handling in wcstombs() is also troublesome; it rejects the entire string if any one character can't be converted. When I had to do this for libarchive, where the code had to be very portable (which precluded using iconv), I ended up doing the following: * Wrote my own converter from UTF-8 to wide characters (fortunately, UTF-8 is pretty simple to decode; this is about 20-30 lines of C) * Used wctomb() to convert one character at a time from wide characters to the current locale. I've found that wctomb() is more portable than a lot of the other functions (I think it's in C89, whereas a lot of the other standard conversion routines were introduced in C99) and provides better error-handling capabilities since it operates on one character at a time (so you can, for instance, convert characters that aren't supported in the current locale into '?' or some kind of \-escape). Feel free to copy any of my code from libarchive if it helps. Tim