Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Jun 2011 00:41:05 +0200
From:      Jilles Tjoelker <jilles@stack.nl>
To:        freebsd-hackers@freebsd.org, freebsd-i18n@freebsd.org
Subject:   tr A-Z a-z in locales other than C
Message-ID:  <20110606224105.GA92410@stack.nl>

next in thread | raw e-mail | index | archive | help
A few years ago, when locale support was added to the tr utility,
character ranges (except ones containing one or two octal escapes) were
changed to use the collation order instead of the character code order.
At the time, this matched other implementations of tr and was apparently
somewhat generally accepted.

However, this behaviour is not intuitive, not portable as it deeply
depends on the collation order and it is very hard to find a useful use
for it. Perhaps there is a use case in EBCDIC locales that only contain
the 2*26 basic Latin letters, but that is rather exotic.

The command tr A-Z a-z may do something unexpected even if there is an
1:1 mapping between upper and lower case, since it also assumes that 'z'
is the last letter.

This is not a POSIX issue as POSIX leaves character ranges in tr
unspecified for locales other than the POSIX locale (except for ranges
containing octal escapes).

If there is no reason to keep using the collation order, I would like to
change tr's character ranges back to character codes. GNU tr does this
and many ports wrongly take advantage of it, so following it will reduce
the need to patch ports.

The below patch demonstrates the new behaviour. The code could be
simplified more as the flags for octal escapes are no longer needed.

The man page may need some additional change as well. In particular, the
command
  tr "[:upper:]" "[:lower:]"
in a user's locale is a good choice for text specified by the user, but
a poor choice for doing case-insensitive comparisons of constant
strings, because in Turkish locales the upper case version of 'i' is a
capital I with dot and the lower case version of 'I' is a lower case i
without dot. In such cases,
  LC_ALL=C tr "[:upper:]" "[:lower:]"
may be a better option (A-Z a-z could be used at the cost of breaking
EBCDIC support).

There is a related issue with ranges in regular expressions, glob and
fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
this is less likely to cause problems.


Index: usr.bin/tr/tr.1
===================================================================
--- usr.bin/tr/tr.1	(revision 222648)
+++ usr.bin/tr/tr.1	(working copy)
@@ -31,7 +31,7 @@
 .\"     @(#)tr.1	8.1 (Berkeley) 6/6/93
 .\" $FreeBSD$
 .\"
-.Dd October 13, 2006
+.Dd June 6, 2011
 .Dt TR 1
 .Os
 .Sh NAME
@@ -158,12 +158,7 @@
 .Pp
 A backslash followed by any other character maps to that character.
 .It c-c
-For non-octal range endpoints
-represents the range of characters between the range endpoints, inclusive,
-in ascending order,
-as defined by the collation sequence.
-If either or both of the range endpoints are octal sequences, it
-represents the range of specific coded values between the
+A range represents the range of specific coded values between the
 range endpoints, inclusive.
 .Pp
 .Bf Em
@@ -309,20 +304,18 @@
 .Pp
 .Dl "tr \*q[=e=]\*q \*qe\*q"
 .Sh COMPATIBILITY
-Previous
-.Fx
-implementations of
-.Nm
-did not order characters in range expressions according to the current
-locale's collation order, making it possible to convert unaccented Latin
+Some implementations of
+.Nm ,
+including the ones in previous versions of
+.Fx ,
+order characters in range expressions according to the current
+locale's collation order, making it impossible to convert unaccented Latin
 characters (esp.\& as found in English text) from upper to lower case using
 the traditional
 .Ux
 idiom of
 .Dq Li "tr A-Z a-z" .
-Since
-.Nm
-now obeys the locale's collation order, this idiom may not produce
+In such implementations, this idiom may not produce
 correct results when there is not a 1:1 mapping between lower and
 upper case, or when the order of characters within the two cases differs.
 As noted in the
Index: usr.bin/tr/str.c
===================================================================
--- usr.bin/tr/str.c	(revision 222648)
+++ usr.bin/tr/str.c	(working copy)
@@ -260,37 +260,13 @@
 		stopval = wc;
 		s->str += clen;
 	}
-	/*
-	 * XXX Characters are not ordered according to collating sequence in
-	 * multibyte locales.
-	 */
-	if (octal || was_octal || MB_CUR_MAX > 1) {
-		if (stopval < s->lastch) {
-			s->str = savestart;
-			return (0);
-		}
-		s->cnt = stopval - s->lastch + 1;
-		s->state = RANGE;
-		--s->lastch;
-		return (1);
-	}
-	if (charcoll((const void *)&stopval, (const void *)&(s->lastch)) < 0) {
+	if (stopval < s->lastch) {
 		s->str = savestart;
 		return (0);
 	}
-	if ((s->set = p = malloc((NCHARS_SB + 1) * sizeof(int))) == NULL)
-		err(1, "genrange() malloc");
-	for (cnt = 0; cnt < NCHARS_SB; cnt++)
-		if (charcoll((const void *)&cnt, (const void *)&(s->lastch)) >= 0 &&
-		    charcoll((const void *)&cnt, (const void *)&stopval) <= 0)
-			*p++ = cnt;
-	*p = OOBCH;
-	n = p - s->set;
-
-	s->cnt = 0;
-	s->state = SET;
-	if (n > 1)
-		mergesort(s->set, n, sizeof(*(s->set)), charcoll);
+	s->cnt = stopval - s->lastch + 1;
+	s->state = RANGE;
+	--s->lastch;
 	return (1);
 }
 

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110606224105.GA92410>