From owner-freebsd-standards@FreeBSD.ORG Wed May 6 17:43:14 2009 Return-Path: Delivered-To: freebsd-standards@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B43021065674 for ; Wed, 6 May 2009 17:43:14 +0000 (UTC) (envelope-from wollman@khavrinen.csail.mit.edu) Received: from khavrinen.csail.mit.edu (khavrinen.csail.mit.edu [128.30.28.20]) by mx1.freebsd.org (Postfix) with ESMTP id 6EAFC8FC1D for ; Wed, 6 May 2009 17:43:13 +0000 (UTC) (envelope-from wollman@khavrinen.csail.mit.edu) Received: from khavrinen.csail.mit.edu (localhost [127.0.0.1]) by khavrinen.csail.mit.edu (8.14.3/8.14.3) with ESMTP id n46Hh5rW098511 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL CN=khavrinen.csail.mit.edu issuer=Client+20CA); Wed, 6 May 2009 13:43:05 -0400 (EDT) (envelope-from wollman@khavrinen.csail.mit.edu) Received: (from wollman@localhost) by khavrinen.csail.mit.edu (8.14.3/8.14.3/Submit) id n46Hh5w0098508; Wed, 6 May 2009 13:43:05 -0400 (EDT) (envelope-from wollman) From: Garrett Wollman MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <18945.52265.44038.498643@khavrinen.csail.mit.edu> Date: Wed, 6 May 2009 13:43:05 -0400 To: Oliver Fromme In-Reply-To: <200905061707.n46H7jqs042942@lurza.secnetix.de> References: <18945.44648.875780.605560@khavrinen.csail.mit.edu> <200905061707.n46H7jqs042942@lurza.secnetix.de> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.0.1 (khavrinen.csail.mit.edu [127.0.0.1]); Wed, 06 May 2009 13:43:05 -0400 (EDT) Cc: freebsd-standards@freebsd.org, juli@clockworksquid.com Subject: Re: Shouldn't cat(1) use the C locale? X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 May 2009 17:43:15 -0000 < said: > Normally cat is agnostic of the encoding of its input data, > because it is handled like binary data. But if the -v > option is used, it has to actually look at the data in > order to decide what is printable and what is not. > This has two consequences: First, it has to know the > encoding of the input, and second, it has to know what > is considered "printable". I think that should be fairly obvious: the input is a stream of bytes, which may or may not encode characters in any locale. > The same is true for binary files. For example, if you have > a binary with embedded ISO8859 strings that you want to display > on a UTF8 terminal, then the following works: > LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8 > It correctly displays German Umlauts and some other characters, > but escapes 8bit characters that are non-printable in the > ISO8859-1 locale. Now try the same thing on a binary with UTF-8 strings in it. (UTF-8 at least gives you a validity constraint on possible multibyte characters, which arbitrary multibyte encodings do not necessarily provide. This mitigates the "reading frame" problem, because the first byte of an actual UTF-8 character cannot be the n'th byte of any UTF-8 character.) -GAWollman