Date: Wed, 8 Dec 2010 22:50:14 -0800 From: Tim Kientzle <tim@kientzle.com> To: Gennady Proskurin <gprspb@mail.ru> Cc: freebsd-arch@freebsd.org Subject: Re: bsdtar and locale Message-ID: <B6670CF3-3D42-4971-B0BF-A311FA3B8D48@kientzle.com> In-Reply-To: <20101208204346.GA1762@gpr.nnz-home.ru> References: <20101208204346.GA1762@gpr.nnz-home.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On Dec 8, 2010, at 12:43 PM, Gennady Proskurin wrote: > bsdtar ... if you archive some file with utf-8 name > in "C" locale (env LC_ALL=3DC tar -c ...), and then extract it in some = UTF-8 > locale, it's name will be corrupted. Such a behaviour is somewhat = documented in > archive_entry(3) and bsdtar(1) manpages, so this is not a bug, but = feature. >=20 > I agree, such conversions can be usefull in some cases, but should be = disabled > by default (we are unix, filenames are just binary data). > It is very annoying, it makes you to always think about locales while = creating > and extracting archive. The extended tar format used by bsdtar comes from the POSIX standard: http://www.opengroup.org/onlinepubs/9699919799/utilities/pax.html The issue you mention is discussed in the standard: > Translating filenames and other attributes from a locale's encoding to = UTF-8 and then back again can lose information, as the resulting = filename might not be byte-for-byte equivalent to the original. To avoid = this problem, users can specify the -o hdrcharset=3Dbinary option, which = will cause the resulting archive to use binary format for all names and = attributes. Such archives are not portable among hosts that use = different native encodings (e.g., EBCDIC versus ASCII-based encodings), = but they will allow interchange among the vast majority of POSIX file = systems in practical use. Also, the -o hdrcharset=3Dbinary option will = cause pax in copy mode to behave more like other standard utilities such = as cp. bsdtar does not yet implement an option equivalent to the -o = hdrcharset=3Dbinary option, but most of the logic is already implemented = in libarchive. Libarchive's write support for pax format does = automatically switch to hdrcharset=3Dbinary for entries if the names = cannot be translated to UTF-8. It should be easy to add a way to = explicitly request this handling for all entries. Cheers, Tim
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B6670CF3-3D42-4971-B0BF-A311FA3B8D48>