Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 May 2009 19:07:45 +0200 (CEST)
From:      Oliver Fromme <olli@lurza.secnetix.de>
To:        freebsd-standards@FreeBSD.ORG, juli@clockworksquid.com
Subject:   Re: Shouldn't cat(1) use the C locale?
Message-ID:  <200905061707.n46H7jqs042942@lurza.secnetix.de>
In-Reply-To: <18945.44648.875780.605560@khavrinen.csail.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
Garrett Wollman wrote:
 > This is a Bad Idea.  cat -v ought to work properly when the input does
 > not consist of "characters" at all.

It depends on your definition of properly.  For me, it
already does work properly (using an ISO8859 locale).
It also works properly for people using a US-ASCII (or C)
locale.  It does not seem to work properly for Juli who
is using a multibyte UTF locale.

Normally cat is agnostic of the encoding of its input data,
because it is handled like binary data.  But if the -v
option is used, it has to actually look at the data in
order to decide what is printable and what is not.
This has two consequences:  First, it has to know the
encoding of the input, and second, it has to know what
is considered "printable".

The problem is that cat has no knowledge of the encoding
of its input data.  Strictly speaking, the locale (LC_CTYPE)
specifies only the properties of the output device.
Furthermore, conversion between different encodings would
be beyond the scope of cat (there are other tools for this).

Therefore the only reasonable thing to do is to assume that
input and output use the same encoding.  So, if you're
working in a UTF locale and use cat to display a file to
the screen, that file should be UTF-encoded or UTF-compatible
(such as US-ASCII), otherwise it will look wrong, no matter
if you use the -v option or not.

The same is true for binary files.  For example, if you have
a binary with embedded ISO8859 strings that you want to display
on a UTF8 terminal, then the following works:
LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8
It correctly displays German Umlauts and some other characters,
but escapes 8bit characters that are non-printable in the
ISO8859-1 locale.

If you want to filter for US-ASCII characters only, then
it's even easier because UTF8 is US-ASCII-compatible, so
you don't need to use recode:  LC_CTYPE=C cat -v file

If you don't use a multibyte locale, and if your files aren't
multibyte encoded either, then you don't have any of the above
problems, of course, and cat will work either way.

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"If Java had true garbage collection, most programs
would delete themselves upon execution."
        -- Robert Sewell



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200905061707.n46H7jqs042942>