From owner-freebsd-arch@FreeBSD.ORG Thu Dec 9 07:17:42 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 206AD1065670 for ; Thu, 9 Dec 2010 07:17:42 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from mail-pv0-f182.google.com (mail-pv0-f182.google.com [74.125.83.182]) by mx1.freebsd.org (Postfix) with ESMTP id F3D258FC1A for ; Thu, 9 Dec 2010 07:17:41 +0000 (UTC) Received: by pvc22 with SMTP id 22so486712pvc.13 for ; Wed, 08 Dec 2010 23:17:41 -0800 (PST) Received: by 10.143.14.21 with SMTP id r21mr3642610wfi.127.1291877418377; Wed, 08 Dec 2010 22:50:18 -0800 (PST) Received: from [10.123.2.178] (99-74-169-43.lightspeed.sntcca.sbcglobal.net [99.74.169.43]) by mx.google.com with ESMTPS id w14sm2006496wfd.18.2010.12.08.22.50.16 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 08 Dec 2010 22:50:17 -0800 (PST) Mime-Version: 1.0 (Apple Message framework v1082) Content-Type: text/plain; charset=us-ascii From: Tim Kientzle In-Reply-To: <20101208204346.GA1762@gpr.nnz-home.ru> Date: Wed, 8 Dec 2010 22:50:14 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20101208204346.GA1762@gpr.nnz-home.ru> To: Gennady Proskurin X-Mailer: Apple Mail (2.1082) Cc: freebsd-arch@freebsd.org Subject: Re: bsdtar and locale X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Dec 2010 07:17:42 -0000 On Dec 8, 2010, at 12:43 PM, Gennady Proskurin wrote: > bsdtar ... if you archive some file with utf-8 name > in "C" locale (env LC_ALL=3DC tar -c ...), and then extract it in some = UTF-8 > locale, it's name will be corrupted. Such a behaviour is somewhat = documented in > archive_entry(3) and bsdtar(1) manpages, so this is not a bug, but = feature. >=20 > I agree, such conversions can be usefull in some cases, but should be = disabled > by default (we are unix, filenames are just binary data). > It is very annoying, it makes you to always think about locales while = creating > and extracting archive. The extended tar format used by bsdtar comes from the POSIX standard: http://www.opengroup.org/onlinepubs/9699919799/utilities/pax.html The issue you mention is discussed in the standard: > Translating filenames and other attributes from a locale's encoding to = UTF-8 and then back again can lose information, as the resulting = filename might not be byte-for-byte equivalent to the original. To avoid = this problem, users can specify the -o hdrcharset=3Dbinary option, which = will cause the resulting archive to use binary format for all names and = attributes. Such archives are not portable among hosts that use = different native encodings (e.g., EBCDIC versus ASCII-based encodings), = but they will allow interchange among the vast majority of POSIX file = systems in practical use. Also, the -o hdrcharset=3Dbinary option will = cause pax in copy mode to behave more like other standard utilities such = as cp. bsdtar does not yet implement an option equivalent to the -o = hdrcharset=3Dbinary option, but most of the logic is already implemented = in libarchive. Libarchive's write support for pax format does = automatically switch to hdrcharset=3Dbinary for entries if the names = cannot be translated to UTF-8. It should be easy to add a way to = explicitly request this handling for all entries. Cheers, Tim