Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 Dec 2010 22:50:14 -0800
From:      Tim Kientzle <tim@kientzle.com>
To:        Gennady Proskurin <gprspb@mail.ru>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: bsdtar and locale
Message-ID:  <B6670CF3-3D42-4971-B0BF-A311FA3B8D48@kientzle.com>
In-Reply-To: <20101208204346.GA1762@gpr.nnz-home.ru>
References:  <20101208204346.GA1762@gpr.nnz-home.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Dec 8, 2010, at 12:43 PM, Gennady Proskurin wrote:
> bsdtar ... if you archive some file with utf-8 name
> in "C" locale (env LC_ALL=3DC tar -c ...), and then extract it in some =
UTF-8
> locale, it's name will be corrupted. Such a behaviour is somewhat =
documented in
> archive_entry(3) and bsdtar(1) manpages, so this is not a bug, but =
feature.
>=20
> I agree, such conversions can be usefull in some cases, but should be =
disabled
> by default (we are unix, filenames are just binary data).
> It is very annoying, it makes you to always think about locales while =
creating
> and extracting archive.

The extended tar format used by bsdtar comes from the POSIX standard:

http://www.opengroup.org/onlinepubs/9699919799/utilities/pax.html

The issue you mention is discussed in the standard:

> Translating filenames and other attributes from a locale's encoding to =
UTF-8 and then back again can lose information, as the resulting =
filename might not be byte-for-byte equivalent to the original. To avoid =
this problem, users can specify the -o hdrcharset=3Dbinary option, which =
will cause the resulting archive to use binary format for all names and =
attributes. Such archives are not portable among hosts that use =
different native encodings (e.g., EBCDIC versus ASCII-based encodings), =
but they will allow interchange among the vast majority of POSIX file =
systems in practical use. Also, the -o hdrcharset=3Dbinary option will =
cause pax in copy mode to behave more like other standard utilities such =
as cp.

bsdtar does not yet implement an option equivalent to the -o =
hdrcharset=3Dbinary option, but most of the logic is already implemented =
in libarchive.  Libarchive's write support for pax format does =
automatically switch to hdrcharset=3Dbinary for entries if the names =
cannot be translated to UTF-8. It should be easy to add a way to =
explicitly request this handling for all entries.

Cheers,

Tim




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B6670CF3-3D42-4971-B0BF-A311FA3B8D48>