From owner-cvs-src@FreeBSD.ORG Sat Mar 15 01:43:59 2008 Return-Path: Delivered-To: cvs-src@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 93E3D106566B; Sat, 15 Mar 2008 01:43:59 +0000 (UTC) (envelope-from kientzle@FreeBSD.org) Received: from repoman.freebsd.org (repoman.freebsd.org [IPv6:2001:4f8:fff6::29]) by mx1.freebsd.org (Postfix) with ESMTP id 7C55B8FC14; Sat, 15 Mar 2008 01:43:59 +0000 (UTC) (envelope-from kientzle@FreeBSD.org) Received: from repoman.freebsd.org (localhost [127.0.0.1]) by repoman.freebsd.org (8.14.1/8.14.1) with ESMTP id m2F1hx7t062720; Sat, 15 Mar 2008 01:43:59 GMT (envelope-from kientzle@repoman.freebsd.org) Received: (from kientzle@localhost) by repoman.freebsd.org (8.14.1/8.14.1/Submit) id m2F1hxt7062719; Sat, 15 Mar 2008 01:43:59 GMT (envelope-from kientzle) Message-Id: <200803150143.m2F1hxt7062719@repoman.freebsd.org> From: Tim Kientzle Date: Sat, 15 Mar 2008 01:43:59 +0000 (UTC) To: src-committers@FreeBSD.org, cvs-src@FreeBSD.org, cvs-all@FreeBSD.org X-FreeBSD-CVS-Branch: HEAD Cc: Subject: cvs commit: src/lib/libarchive archive_read_support_format_tar.c archive_write_set_format_pax.c src/lib/libarchive/test Makefile test_pax_filename_encoding.c test_pax_filename_encoding.tar.gz.uu X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 01:43:59 -0000 kientzle 2008-03-15 01:43:59 UTC FreeBSD src repository Modified files: lib/libarchive archive_read_support_format_tar.c archive_write_set_format_pax.c lib/libarchive/test Makefile Added files: lib/libarchive/test test_pax_filename_encoding.c test_pax_filename_encoding.tar.gz.uu Log: A subtle point: "pax interchange format" mandates that all strings (including pathname, gname, uname) be stored in UTF-8. This usually doesn't cause problems on FreeBSD because the "C" locale on FreeBSD can convert any byte to Unicode/wchar_t and from there to UTF-8. In other locales (including the "C" locale on Linux which is really ASCII), you can get into trouble with pathnames that cannot be converted to UTF-8. Libarchive's pax writer truncated pathnames and other strings at the first nonconvertible character. (ouch!) Other archivers have worked around this by storing unconvertible pathnames as raw binary, a practice which has been sanctioned by the Austin group. However, libarchive's pax reader would segfault reading headers that weren't proper UTF-8. (ouch!) Since bsdtar defaults to pax format, this affects bsdtar rather heavily. To correctly support the new "hdrcharset" header that is going into SUS and to handle conversion failures in general, libarchive's pax reader and writer have been overhauled fairly extensively. They used to do most of the pax header processing using wchar_t (Unicode); they now do most of it using char so that common logic applies to either UTF-8 or "binary" strings. As a bonus, a number of extraneous conversions to/from wchar_t have been eliminated, which should speed things up just a tad. Thanks to: Bjoern Jacke for originally reporting this to me Thanks to: Joerg Sonnenberger for noting a bad typo in my first draft of this Thanks to: Gunnar Ritter for getting the standard fixed MFC after: 5 days Revision Changes Path 1.67 +240 -209 src/lib/libarchive/archive_read_support_format_tar.c 1.43 +126 -50 src/lib/libarchive/archive_write_set_format_pax.c 1.17 +1 -0 src/lib/libarchive/test/Makefile 1.1 +161 -0 src/lib/libarchive/test/test_pax_filename_encoding.c (new) 1.1 +10 -0 src/lib/libarchive/test/test_pax_filename_encoding.tar.gz.uu (new)