From owner-freebsd-standards@FreeBSD.ORG Wed May 6 17:08:12 2009 Return-Path: Delivered-To: freebsd-standards@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8FF1F106566B for ; Wed, 6 May 2009 17:08:12 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [IPv6:2a01:170:102f::2]) by mx1.freebsd.org (Postfix) with ESMTP id 13E188FC0C for ; Wed, 6 May 2009 17:08:11 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (localhost [127.0.0.1]) by lurza.secnetix.de (8.14.3/8.14.3) with ESMTP id n46H7j68042943; Wed, 6 May 2009 19:08:08 +0200 (CEST) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.14.3/8.14.3/Submit) id n46H7jqs042942; Wed, 6 May 2009 19:07:45 +0200 (CEST) (envelope-from olli) Date: Wed, 6 May 2009 19:07:45 +0200 (CEST) Message-Id: <200905061707.n46H7jqs042942@lurza.secnetix.de> From: Oliver Fromme To: freebsd-standards@FreeBSD.ORG, juli@clockworksquid.com In-Reply-To: <18945.44648.875780.605560@khavrinen.csail.mit.edu> X-Newsgroups: list.freebsd-standards User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (FreeBSD/6.4-PRERELEASE-20080904 (i386)) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Wed, 06 May 2009 19:08:08 +0200 (CEST) Cc: Subject: Re: Shouldn't cat(1) use the C locale? X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 May 2009 17:08:12 -0000 Garrett Wollman wrote: > This is a Bad Idea. cat -v ought to work properly when the input does > not consist of "characters" at all. It depends on your definition of properly. For me, it already does work properly (using an ISO8859 locale). It also works properly for people using a US-ASCII (or C) locale. It does not seem to work properly for Juli who is using a multibyte UTF locale. Normally cat is agnostic of the encoding of its input data, because it is handled like binary data. But if the -v option is used, it has to actually look at the data in order to decide what is printable and what is not. This has two consequences: First, it has to know the encoding of the input, and second, it has to know what is considered "printable". The problem is that cat has no knowledge of the encoding of its input data. Strictly speaking, the locale (LC_CTYPE) specifies only the properties of the output device. Furthermore, conversion between different encodings would be beyond the scope of cat (there are other tools for this). Therefore the only reasonable thing to do is to assume that input and output use the same encoding. So, if you're working in a UTF locale and use cat to display a file to the screen, that file should be UTF-encoded or UTF-compatible (such as US-ASCII), otherwise it will look wrong, no matter if you use the -v option or not. The same is true for binary files. For example, if you have a binary with embedded ISO8859 strings that you want to display on a UTF8 terminal, then the following works: LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8 It correctly displays German Umlauts and some other characters, but escapes 8bit characters that are non-printable in the ISO8859-1 locale. If you want to filter for US-ASCII characters only, then it's even easier because UTF8 is US-ASCII-compatible, so you don't need to use recode: LC_CTYPE=C cat -v file If you don't use a multibyte locale, and if your files aren't multibyte encoded either, then you don't have any of the above problems, of course, and cat will work either way. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell