From owner-freebsd-standards@FreeBSD.ORG Wed May 6 08:32:05 2009 Return-Path: Delivered-To: freebsd-standards@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C34FA1065673 for ; Wed, 6 May 2009 08:32:05 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [IPv6:2a01:170:102f::2]) by mx1.freebsd.org (Postfix) with ESMTP id 42E218FC27 for ; Wed, 6 May 2009 08:32:05 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (localhost [127.0.0.1]) by lurza.secnetix.de (8.14.3/8.14.3) with ESMTP id n468VdbJ018432; Wed, 6 May 2009 10:32:02 +0200 (CEST) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.14.3/8.14.3/Submit) id n468VcRE018431; Wed, 6 May 2009 10:31:38 +0200 (CEST) (envelope-from olli) Date: Wed, 6 May 2009 10:31:38 +0200 (CEST) Message-Id: <200905060831.n468VcRE018431@lurza.secnetix.de> From: Oliver Fromme To: freebsd-standards@FreeBSD.ORG, juli@clockworksquid.com In-Reply-To: X-Newsgroups: list.freebsd-standards User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (FreeBSD/6.4-PRERELEASE-20080904 (i386)) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Wed, 06 May 2009 10:32:02 +0200 (CEST) Cc: Subject: Re: Shouldn't cat(1) use the C locale? X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-standards@FreeBSD.ORG, juli@clockworksquid.com List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 May 2009 08:32:06 -0000 Juli Mallett wrote: > The cat manpage suggests that the infamous, non-standard -v extension > is ASCII-oriented but cat(1) these days uses isprint and pals and > calls setlocale(LC_CTYPE, ""), which for those of us with dodgy > environments (mine includes LC_ALL=en_US.UTF-8), means that "cat -v" > behaves radically-differently to the manual page describes. > > Does anyone see any reason for our extensions, etc., to work with > LC_CTYPE != C? It doesn't make a lot of sense to me. I'd like to > change it if there's not a good reason to keep it broken this way, > like: > > - setlocale(LC_CTYPE, ""); > + setlocale(LC_CTYPE, "C"); > > Thoughts, etc.? This is a difficult matter. I guess when you ask n people, you will get n different opinions. Well, here's mine ... I think this is a bug in the manual page. When cat(1) is using the current locale, that's perfectly correct behaviour in a world that is clearly moving away from ASCII, towards unicode. "Fixing" it by always using the ASCII locale would be a step backwards. Instead it is better to work on bringing all of the tools to compliance with multibyte character encodings in general, and with UTF8 in particular, which seems to be the most important unicode encoding these days (and probably UTF16, too). So I think the manual page should be fixed so it says that the -v option handles non-printing characters in the current locale, and cat needs to be fixed to handle multibyte chars correctly if the -v option is used with a UTF locale. By the way, your patch would probably be a POLA violation. I currently have LC_CTYPE=de_DE.ISO8859-15 on most of my machines (because FreeBSD's UTF support is too incomplete at the moment), and I'm occasionally using "cat -v" to look for non-printable characters in that locale. In fact I have a zsh function: "diff -u =(cat $1) =(cat -v $1)" Your patch would break that. I'm already somewhat annoyed that locale support was broken in strings(1). Some time ago, it used the current locale so I could use it on German texts with my LC_CTYPE setting. At some point in time, they probably introduced a patch similar to yours and instead provided the -e option, which does not work as expected ("-e S" is completely useless because it prints characters that are non-printable in ISO8859 locales). Since then I was forced to use cat -v for that purpose. Now you're proposing to break that, too. I hope that explains a little bit why I'm against that change. ;-) Best regards Oliver PS: If you set LC_* to a UTF locale, but your environment (i.e. tools and adat) is not UTF-compliant, breakage is expected. If you still want to keep that LC_* setting, a workaround would be to make aliases cat='LC_CTYPE=C cat' or similar for tools that seem to be broken. I also recommend *not* to set LC_ALL, but instead set LANG. The differenc is that you can override LANG, like in the above example ("LC_CTYPE=C cat"). You cannot override LC_ALL, because LC_ALL overrides everything else. See the environ(7) manual page for details. -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "Perl will consistently give you what you want, unless what you want is consistency." -- Larry Wall