From owner-freebsd-stable@FreeBSD.ORG  Tue Feb  7 15:48:07 2006
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@FreeBSD.ORG
Delivered-To: freebsd-stable@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C8B4D16A420
	for <freebsd-stable@FreeBSD.ORG>; Tue,  7 Feb 2006 15:48:07 +0000 (GMT)
	(envelope-from olli@lurza.secnetix.de)
Received: from lurza.secnetix.de (lurza.secnetix.de [83.120.8.8])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3F32843D45
	for <freebsd-stable@FreeBSD.ORG>; Tue,  7 Feb 2006 15:48:06 +0000 (GMT)
	(envelope-from olli@lurza.secnetix.de)
Received: from lurza.secnetix.de (ybkven@localhost [127.0.0.1])
	by lurza.secnetix.de (8.13.4/8.13.4) with ESMTP id k17Fm0uZ017167
	for <freebsd-stable@FreeBSD.ORG>; Tue, 7 Feb 2006 16:48:05 +0100 (CET)
	(envelope-from oliver.fromme@secnetix.de)
Received: (from olli@localhost)
	by lurza.secnetix.de (8.13.4/8.13.1/Submit) id k17FlxwV017166;
	Tue, 7 Feb 2006 16:47:59 +0100 (CET) (envelope-from olli)
Date: Tue, 7 Feb 2006 16:47:59 +0100 (CET)
Message-Id: <200602071547.k17FlxwV017166@lurza.secnetix.de>
From: Oliver Fromme <olli@lurza.secnetix.de>
To: freebsd-stable@FreeBSD.ORG
In-Reply-To: <43E7FDAA.3010409@gmx.de>
X-Newsgroups: list.freebsd-stable
User-Agent: tin/1.8.0-20051224 ("Ronay") (UNIX) (FreeBSD/4.11-STABLE (i386))
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2
	(lurza.secnetix.de [127.0.0.1]);
	Tue, 07 Feb 2006 16:48:05 +0100 (CET)
Cc: 
Subject: Re: tr(1) buggy with de_DE.ISO8859-1(5) locale?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: freebsd-stable@FreeBSD.ORG
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Feb 2006 15:48:08 -0000

Martin Krzysiak <cinek@gmx.de> wrote:
 > Oliver Fromme wrote:
 > > It's not a bug.  It's perfectly POSIX-compatible.
 > 
 > I think this behavior is "undefined" in POSIX,

That's correct.  Which means that FreeBSD's tr(1) is
POSIX-compatible.  And any script which assumes that
"tr a-z A-Z" works in any locale is _not_ POSIX-
compatible.

Specifically, SUSv3 (a.k.a. POSIX-2001) says:

    LC_COLLATE
        Determine the locale for the behavior of range
        expressions and equivalence classes.

And it also specifically mentions the following as an
example that must be used for case conversions:

    tr -s '[:upper:]' '[:lower:]'

 > It's not only upper-lowercase conversion that is weird.
 > Try "echo wxyz | tr w-z a-d". Ranges are broken generally
 > in ISO-locales, in my opinion.

Ranges are not broken, they just work as defined by the
locale.  It's an error to assume that "a-d" always means
the four letters a, b, c, d.  That's only true in the
US-ASCII locale (a.k.a. "C" or POSIX locale).

When you're browsing in an index of German words, you
_do_ want them to be ordered correctly, don't you?
That is, you expect words starting with a-umlaut ("ä")
to be ordered along with "a", not after "z" or anywhere
else.  Therefore, the collation definitions are correct,
not broken.

 > > By the way:  Do not set LANG or LC_ALL, expecially for
 > > the root user, and especially when compiling things.
 > 
 > One thing I like about FreeBSD is that I have my German
 > environment.

What do you mean by "German environment"?  I also have a
German environment, but I only set LC_CTYPE, not LC_ALL,
LANG or LC_COLLATE.

 > But you are right. The only locale that is
 > expected to work correctly is "C".

I think that all locales work correctly, as far as I can
tell.  At least the German ones that I use work correctly.

The only problem is that script authors that use tr(1)
make illegal assumptions about the behaviour of ranges.

 > How many times did you use tr(1) to convert your texts
 > to upper/lower case? Do you expect that it works correctly?

I don't have LC_COLLATE set (or LANG or LC_ALL), so I
expect that "tr a-z A-Z" works in the usual way when
used for English texts.

I never need to convert German texts from lower case to
upper case.  But if I had to do that, the following way
that you mentioned would work fine for me, too (except
that I have to convert sharp-s ("ß") to "SS" manually):

 > I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ",

When writing scripts, I either use the correct tr syntax
with [:lower:] [:upper:], or -- if you know that locale
support is not required -- put "unset LC_ALL LC_COLLATE
LANG" at the beginning.

Note that tr(1) is not appropriate to perform non-English
case conversions in general.  For example, it does never
handle the German sharp-s ("ß") correctly, no matter how
you set your locale, and no matter what syntax you use
with tr.  This is a limitation which cannot be easily
solved, unfortunately.  And German is easy ...  There are
languages with more complicated rules.  For example, in
Turkish, the letter "I" is not the upper-case of "i".

 > For people who are interested in a simple workaround.
 > Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8.
 > tr(1)'s ranges work like expected there.

tr's ranges _always_ work as expected, given how locales
work (especially LC_COLLATE).  Using UTF-8 encoding
doesn't guarantee that 'a-z' works for case conversions
either.  The _only_ reliable way is to use character
classes, as mentioned several times.

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'