From owner-freebsd-stable@FreeBSD.ORG Tue Feb 7 15:48:07 2006 Return-Path: X-Original-To: freebsd-stable@FreeBSD.ORG Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C8B4D16A420 for ; Tue, 7 Feb 2006 15:48:07 +0000 (GMT) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [83.120.8.8]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3F32843D45 for ; Tue, 7 Feb 2006 15:48:06 +0000 (GMT) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (ybkven@localhost [127.0.0.1]) by lurza.secnetix.de (8.13.4/8.13.4) with ESMTP id k17Fm0uZ017167 for ; Tue, 7 Feb 2006 16:48:05 +0100 (CET) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.13.4/8.13.1/Submit) id k17FlxwV017166; Tue, 7 Feb 2006 16:47:59 +0100 (CET) (envelope-from olli) Date: Tue, 7 Feb 2006 16:47:59 +0100 (CET) Message-Id: <200602071547.k17FlxwV017166@lurza.secnetix.de> From: Oliver Fromme To: freebsd-stable@FreeBSD.ORG In-Reply-To: <43E7FDAA.3010409@gmx.de> X-Newsgroups: list.freebsd-stable User-Agent: tin/1.8.0-20051224 ("Ronay") (UNIX) (FreeBSD/4.11-STABLE (i386)) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Tue, 07 Feb 2006 16:48:05 +0100 (CET) Cc: Subject: Re: tr(1) buggy with de_DE.ISO8859-1(5) locale? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-stable@FreeBSD.ORG List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Feb 2006 15:48:08 -0000 Martin Krzysiak wrote: > Oliver Fromme wrote: > > It's not a bug. It's perfectly POSIX-compatible. > > I think this behavior is "undefined" in POSIX, That's correct. Which means that FreeBSD's tr(1) is POSIX-compatible. And any script which assumes that "tr a-z A-Z" works in any locale is _not_ POSIX- compatible. Specifically, SUSv3 (a.k.a. POSIX-2001) says: LC_COLLATE Determine the locale for the behavior of range expressions and equivalence classes. And it also specifically mentions the following as an example that must be used for case conversions: tr -s '[:upper:]' '[:lower:]' > It's not only upper-lowercase conversion that is weird. > Try "echo wxyz | tr w-z a-d". Ranges are broken generally > in ISO-locales, in my opinion. Ranges are not broken, they just work as defined by the locale. It's an error to assume that "a-d" always means the four letters a, b, c, d. That's only true in the US-ASCII locale (a.k.a. "C" or POSIX locale). When you're browsing in an index of German words, you _do_ want them to be ordered correctly, don't you? That is, you expect words starting with a-umlaut ("ä") to be ordered along with "a", not after "z" or anywhere else. Therefore, the collation definitions are correct, not broken. > > By the way: Do not set LANG or LC_ALL, expecially for > > the root user, and especially when compiling things. > > One thing I like about FreeBSD is that I have my German > environment. What do you mean by "German environment"? I also have a German environment, but I only set LC_CTYPE, not LC_ALL, LANG or LC_COLLATE. > But you are right. The only locale that is > expected to work correctly is "C". I think that all locales work correctly, as far as I can tell. At least the German ones that I use work correctly. The only problem is that script authors that use tr(1) make illegal assumptions about the behaviour of ranges. > How many times did you use tr(1) to convert your texts > to upper/lower case? Do you expect that it works correctly? I don't have LC_COLLATE set (or LANG or LC_ALL), so I expect that "tr a-z A-Z" works in the usual way when used for English texts. I never need to convert German texts from lower case to upper case. But if I had to do that, the following way that you mentioned would work fine for me, too (except that I have to convert sharp-s ("ß") to "SS" manually): > I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ", When writing scripts, I either use the correct tr syntax with [:lower:] [:upper:], or -- if you know that locale support is not required -- put "unset LC_ALL LC_COLLATE LANG" at the beginning. Note that tr(1) is not appropriate to perform non-English case conversions in general. For example, it does never handle the German sharp-s ("ß") correctly, no matter how you set your locale, and no matter what syntax you use with tr. This is a limitation which cannot be easily solved, unfortunately. And German is easy ... There are languages with more complicated rules. For example, in Turkish, the letter "I" is not the upper-case of "i". > For people who are interested in a simple workaround. > Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8. > tr(1)'s ranges work like expected there. tr's ranges _always_ work as expected, given how locales work (especially LC_COLLATE). Using UTF-8 encoding doesn't guarantee that 'a-z' works for case conversions either. The _only_ reliable way is to use character classes, as mentioned several times. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.'