Date: Wed, 6 May 2009 19:07:45 +0200 (CEST) From: Oliver Fromme <olli@lurza.secnetix.de> To: freebsd-standards@FreeBSD.ORG, juli@clockworksquid.com Subject: Re: Shouldn't cat(1) use the C locale? Message-ID: <200905061707.n46H7jqs042942@lurza.secnetix.de> In-Reply-To: <18945.44648.875780.605560@khavrinen.csail.mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Garrett Wollman wrote: > This is a Bad Idea. cat -v ought to work properly when the input does > not consist of "characters" at all. It depends on your definition of properly. For me, it already does work properly (using an ISO8859 locale). It also works properly for people using a US-ASCII (or C) locale. It does not seem to work properly for Juli who is using a multibyte UTF locale. Normally cat is agnostic of the encoding of its input data, because it is handled like binary data. But if the -v option is used, it has to actually look at the data in order to decide what is printable and what is not. This has two consequences: First, it has to know the encoding of the input, and second, it has to know what is considered "printable". The problem is that cat has no knowledge of the encoding of its input data. Strictly speaking, the locale (LC_CTYPE) specifies only the properties of the output device. Furthermore, conversion between different encodings would be beyond the scope of cat (there are other tools for this). Therefore the only reasonable thing to do is to assume that input and output use the same encoding. So, if you're working in a UTF locale and use cat to display a file to the screen, that file should be UTF-encoded or UTF-compatible (such as US-ASCII), otherwise it will look wrong, no matter if you use the -v option or not. The same is true for binary files. For example, if you have a binary with embedded ISO8859 strings that you want to display on a UTF8 terminal, then the following works: LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8 It correctly displays German Umlauts and some other characters, but escapes 8bit characters that are non-printable in the ISO8859-1 locale. If you want to filter for US-ASCII characters only, then it's even easier because UTF8 is US-ASCII-compatible, so you don't need to use recode: LC_CTYPE=C cat -v file If you don't use a multibyte locale, and if your files aren't multibyte encoded either, then you don't have any of the above problems, of course, and cat will work either way. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200905061707.n46H7jqs042942>