Date: Wed, 6 May 2009 13:43:05 -0400 From: Garrett Wollman <wollman@csail.mit.edu> To: Oliver Fromme <olli@lurza.secnetix.de> Cc: freebsd-standards@freebsd.org, juli@clockworksquid.com Subject: Re: Shouldn't cat(1) use the C locale? Message-ID: <18945.52265.44038.498643@khavrinen.csail.mit.edu> In-Reply-To: <200905061707.n46H7jqs042942@lurza.secnetix.de> References: <18945.44648.875780.605560@khavrinen.csail.mit.edu> <200905061707.n46H7jqs042942@lurza.secnetix.de>
next in thread | previous in thread | raw e-mail | index | archive | help
<<On Wed, 6 May 2009 19:07:45 +0200 (CEST), Oliver Fromme <olli@lurza.secnetix.de> said: > Normally cat is agnostic of the encoding of its input data, > because it is handled like binary data. But if the -v > option is used, it has to actually look at the data in > order to decide what is printable and what is not. > This has two consequences: First, it has to know the > encoding of the input, and second, it has to know what > is considered "printable". I think that should be fairly obvious: the input is a stream of bytes, which may or may not encode characters in any locale. > The same is true for binary files. For example, if you have > a binary with embedded ISO8859 strings that you want to display > on a UTF8 terminal, then the following works: > LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8 > It correctly displays German Umlauts and some other characters, > but escapes 8bit characters that are non-printable in the > ISO8859-1 locale. Now try the same thing on a binary with UTF-8 strings in it. (UTF-8 at least gives you a validity constraint on possible multibyte characters, which arbitrary multibyte encodings do not necessarily provide. This mitigates the "reading frame" problem, because the first byte of an actual UTF-8 character cannot be the n'th byte of any UTF-8 character.) -GAWollman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?18945.52265.44038.498643>