Date: Tue, 7 Feb 2006 16:47:59 +0100 (CET) From: Oliver Fromme <olli@lurza.secnetix.de> To: freebsd-stable@FreeBSD.ORG Subject: Re: tr(1) buggy with de_DE.ISO8859-1(5) locale? Message-ID: <200602071547.k17FlxwV017166@lurza.secnetix.de> In-Reply-To: <43E7FDAA.3010409@gmx.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Martin Krzysiak <cinek@gmx.de> wrote: > Oliver Fromme wrote: > > It's not a bug. It's perfectly POSIX-compatible. > > I think this behavior is "undefined" in POSIX, That's correct. Which means that FreeBSD's tr(1) is POSIX-compatible. And any script which assumes that "tr a-z A-Z" works in any locale is _not_ POSIX- compatible. Specifically, SUSv3 (a.k.a. POSIX-2001) says: LC_COLLATE Determine the locale for the behavior of range expressions and equivalence classes. And it also specifically mentions the following as an example that must be used for case conversions: tr -s '[:upper:]' '[:lower:]' > It's not only upper-lowercase conversion that is weird. > Try "echo wxyz | tr w-z a-d". Ranges are broken generally > in ISO-locales, in my opinion. Ranges are not broken, they just work as defined by the locale. It's an error to assume that "a-d" always means the four letters a, b, c, d. That's only true in the US-ASCII locale (a.k.a. "C" or POSIX locale). When you're browsing in an index of German words, you _do_ want them to be ordered correctly, don't you? That is, you expect words starting with a-umlaut ("ä") to be ordered along with "a", not after "z" or anywhere else. Therefore, the collation definitions are correct, not broken. > > By the way: Do not set LANG or LC_ALL, expecially for > > the root user, and especially when compiling things. > > One thing I like about FreeBSD is that I have my German > environment. What do you mean by "German environment"? I also have a German environment, but I only set LC_CTYPE, not LC_ALL, LANG or LC_COLLATE. > But you are right. The only locale that is > expected to work correctly is "C". I think that all locales work correctly, as far as I can tell. At least the German ones that I use work correctly. The only problem is that script authors that use tr(1) make illegal assumptions about the behaviour of ranges. > How many times did you use tr(1) to convert your texts > to upper/lower case? Do you expect that it works correctly? I don't have LC_COLLATE set (or LANG or LC_ALL), so I expect that "tr a-z A-Z" works in the usual way when used for English texts. I never need to convert German texts from lower case to upper case. But if I had to do that, the following way that you mentioned would work fine for me, too (except that I have to convert sharp-s ("ß") to "SS" manually): > I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ", When writing scripts, I either use the correct tr syntax with [:lower:] [:upper:], or -- if you know that locale support is not required -- put "unset LC_ALL LC_COLLATE LANG" at the beginning. Note that tr(1) is not appropriate to perform non-English case conversions in general. For example, it does never handle the German sharp-s ("ß") correctly, no matter how you set your locale, and no matter what syntax you use with tr. This is a limitation which cannot be easily solved, unfortunately. And German is easy ... There are languages with more complicated rules. For example, in Turkish, the letter "I" is not the upper-case of "i". > For people who are interested in a simple workaround. > Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8. > tr(1)'s ranges work like expected there. tr's ranges _always_ work as expected, given how locales work (especially LC_COLLATE). Using UTF-8 encoding doesn't guarantee that 'a-z' works for case conversions either. The _only_ reliable way is to use character classes, as mentioned several times. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.'
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200602071547.k17FlxwV017166>