From owner-freebsd-hackers@FreeBSD.ORG Mon Jun 6 22:41:06 2011 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BB965106566C; Mon, 6 Jun 2011 22:41:06 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107]) by mx1.freebsd.org (Postfix) with ESMTP id 609998FC1B; Mon, 6 Jun 2011 22:41:06 +0000 (UTC) Received: from turtle.stack.nl (turtle.stack.nl [IPv6:2001:610:1108:5010::132]) by mx1.stack.nl (Postfix) with ESMTP id 621C01DD630; Tue, 7 Jun 2011 00:41:05 +0200 (CEST) Received: by turtle.stack.nl (Postfix, from userid 1677) id 5BC02173D9; Tue, 7 Jun 2011 00:41:05 +0200 (CEST) Date: Tue, 7 Jun 2011 00:41:05 +0200 From: Jilles Tjoelker To: freebsd-hackers@freebsd.org, freebsd-i18n@freebsd.org Message-ID: <20110606224105.GA92410@stack.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Subject: tr A-Z a-z in locales other than C X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jun 2011 22:41:06 -0000 A few years ago, when locale support was added to the tr utility, character ranges (except ones containing one or two octal escapes) were changed to use the collation order instead of the character code order. At the time, this matched other implementations of tr and was apparently somewhat generally accepted. However, this behaviour is not intuitive, not portable as it deeply depends on the collation order and it is very hard to find a useful use for it. Perhaps there is a use case in EBCDIC locales that only contain the 2*26 basic Latin letters, but that is rather exotic. The command tr A-Z a-z may do something unexpected even if there is an 1:1 mapping between upper and lower case, since it also assumes that 'z' is the last letter. This is not a POSIX issue as POSIX leaves character ranges in tr unspecified for locales other than the POSIX locale (except for ranges containing octal escapes). If there is no reason to keep using the collation order, I would like to change tr's character ranges back to character codes. GNU tr does this and many ports wrongly take advantage of it, so following it will reduce the need to patch ports. The below patch demonstrates the new behaviour. The code could be simplified more as the flags for octal escapes are no longer needed. The man page may need some additional change as well. In particular, the command tr "[:upper:]" "[:lower:]" in a user's locale is a good choice for text specified by the user, but a poor choice for doing case-insensitive comparisons of constant strings, because in Turkish locales the upper case version of 'i' is a capital I with dot and the lower case version of 'I' is a lower case i without dot. In such cases, LC_ALL=C tr "[:upper:]" "[:lower:]" may be a better option (A-Z a-z could be used at the cost of breaking EBCDIC support). There is a related issue with ranges in regular expressions, glob and fnmatch (likewise unspecified by POSIX outside the POSIX locale), but this is less likely to cause problems. Index: usr.bin/tr/tr.1 =================================================================== --- usr.bin/tr/tr.1 (revision 222648) +++ usr.bin/tr/tr.1 (working copy) @@ -31,7 +31,7 @@ .\" @(#)tr.1 8.1 (Berkeley) 6/6/93 .\" $FreeBSD$ .\" -.Dd October 13, 2006 +.Dd June 6, 2011 .Dt TR 1 .Os .Sh NAME @@ -158,12 +158,7 @@ .Pp A backslash followed by any other character maps to that character. .It c-c -For non-octal range endpoints -represents the range of characters between the range endpoints, inclusive, -in ascending order, -as defined by the collation sequence. -If either or both of the range endpoints are octal sequences, it -represents the range of specific coded values between the +A range represents the range of specific coded values between the range endpoints, inclusive. .Pp .Bf Em @@ -309,20 +304,18 @@ .Pp .Dl "tr \*q[=e=]\*q \*qe\*q" .Sh COMPATIBILITY -Previous -.Fx -implementations of -.Nm -did not order characters in range expressions according to the current -locale's collation order, making it possible to convert unaccented Latin +Some implementations of +.Nm , +including the ones in previous versions of +.Fx , +order characters in range expressions according to the current +locale's collation order, making it impossible to convert unaccented Latin characters (esp.\& as found in English text) from upper to lower case using the traditional .Ux idiom of .Dq Li "tr A-Z a-z" . -Since -.Nm -now obeys the locale's collation order, this idiom may not produce +In such implementations, this idiom may not produce correct results when there is not a 1:1 mapping between lower and upper case, or when the order of characters within the two cases differs. As noted in the Index: usr.bin/tr/str.c =================================================================== --- usr.bin/tr/str.c (revision 222648) +++ usr.bin/tr/str.c (working copy) @@ -260,37 +260,13 @@ stopval = wc; s->str += clen; } - /* - * XXX Characters are not ordered according to collating sequence in - * multibyte locales. - */ - if (octal || was_octal || MB_CUR_MAX > 1) { - if (stopval < s->lastch) { - s->str = savestart; - return (0); - } - s->cnt = stopval - s->lastch + 1; - s->state = RANGE; - --s->lastch; - return (1); - } - if (charcoll((const void *)&stopval, (const void *)&(s->lastch)) < 0) { + if (stopval < s->lastch) { s->str = savestart; return (0); } - if ((s->set = p = malloc((NCHARS_SB + 1) * sizeof(int))) == NULL) - err(1, "genrange() malloc"); - for (cnt = 0; cnt < NCHARS_SB; cnt++) - if (charcoll((const void *)&cnt, (const void *)&(s->lastch)) >= 0 && - charcoll((const void *)&cnt, (const void *)&stopval) <= 0) - *p++ = cnt; - *p = OOBCH; - n = p - s->set; - - s->cnt = 0; - s->state = SET; - if (n > 1) - mergesort(s->set, n, sizeof(*(s->set)), charcoll); + s->cnt = stopval - s->lastch + 1; + s->state = RANGE; + --s->lastch; return (1); } -- Jilles Tjoelker