Skip site navigation (1)Skip section navigation (2)
Date:      Wed,  1 Dec 2004 14:40:14 +0100
From:      Alexander Leidinger <Alexander@Leidinger.net>
To:        current@freebsd.org
Cc:        tode@bpanet.de
Subject:   Bug in our ru_RU.KOI8-R locale (with patch)?
Message-ID:  <1101908414.41adc9be50c73@netchild.homeip.net>

next in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]
Hi,

I got a report that our ru_RU.KOI8-R locale seems to be broken. Attached
is a test program (test.pl, tested with perl 5.8.2) and some test input
(test.txt) which is supposed to show the problem. I can't read any
kyrillic language, so I can't really confirm if the attached patch is the
right fix.

If you run the test program you should see something like this (strange
looking text maybe because of the webmailer I use):
---snip---
Match small (RegEx with i flag): 0
Match small (RegEx without i flag): 8
Match for normal (RegEx with i flag): 17
Match for normal (RegEx without i flag): 9

Case - Check for '&#1103;&#1107;&#1112;&#1098;&#1096;&#1101;'
lc() => &#1103;&#1107;&#1112;&#1098;&#1096;&#1101;
uc() => &#1071;&#1075;&#1080;&#1066;&#1064;&#1069;
lcfirst() => &#1103;&#1107;&#1112;&#1098;&#1096;&#1101;
ucfirst() => &#1071;&#1107;&#1112;&#1098;&#1096;&#1101;

Case - Check for '&#1071;&#1107;&#1112;&#1098;&#1096;&#1101;'
lc() => &#1103;&#1107;&#1112;&#1098;&#1096;&#1101;
uc() => &#1071;&#1075;&#1080;&#1066;&#1064;&#1069;
lcfirst() => &#1103;&#1107;&#1112;&#1098;&#1096;&#1101;
ucfirst() => &#1071;&#1107;&#1112;&#1098;&#1096;&#1101;
---snip---

I'm told the "Case - Check" parts are correct with the patch, but not
without it (lc() -> lower case the entire string; uc() -> upper case the
entire string; lcfirst() -> lower case the first character; ...). Can
someone please confirm this?

If this is correct we've solved only a part of the problem. The other
part seems to be related to LC_COLLATE. "Match small" with the i flag
(case insensitive matching) shouldn't print 0 when "Match normal" with
the i flag doesn't print 0. Any ideas how to solve this?

If the patch isn't correct we still have a bug somwhere (please CC
perl@freebsd.org then). Why isn't perl able to do a case insensitive
match in the ru_RU.KOI8-R locale?

BTW.: this affects 4.x (problem noticed here), 5.x and -current (I've
tested the patch here).

Bye,
Alexander.

-- 
http://www.Leidinger.net/     Alexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org/        netchild @ FreeBSD.org  : PGP ID = 72077137

[-- Attachment #2 --]
#!/usr/bin/env perl

use locale;

my $file		= 'test.txt';
my $pushkin_small	= 'пушкин';
my $pushkin_normal	= 'Пушкин';

my $data		= LoadFile($file);

my $count_normal_i	= 0;
my $count_small_i	= 0;
my $count_normal      = 0;
my $count_small       = 0;

foreach my $line (@{$data}) {
	$count_normal_i++ if ($line =~ m/$pushkin_normal/isg);
	$count_small_i++ if ($line =~ m/$pushkin_small/isg);
	$count_normal++ if ($line =~ m/$pushkin_normal/sg);
        $count_small++ if ($line =~ m/$pushkin_small/sg);
}

print "Match small (RegEx with i flag): $count_small_i\n";
print "Match small (RegEx without i flag): $count_small\n";

print "Match for normal (RegEx with i flag): $count_normal_i\n";
print "Match for normal (RegEx without i flag): $count_normal\n\n";
TestCase($pushkin_small);
TestCase($pushkin_normal);

exit(0);


sub TestCase {
	my $string	= shift(@_);
	print "Case - Check for \'$string\'\n";
	print "lc() => ".lc($string)."\n";
	print "uc() => ".uc($string)."\n";
	print "lcfirst() => ".lcfirst($string)."\n";
	print "ucfirst() => ".ucfirst($string)."\n";
	
	print "\n";

	return 1;
}


sub LoadFile {
	my $file	= shift(@_);
	my @value	= ();
	open(FILE, "<$file");
	@value		= <FILE>;
	close(FILE);
	chomp(@value);
	return \@value;
}


[-- Attachment #3 --]
пушкин
Пушкин
Test
Test
TEST
tEST
пушкин
Пушкин
Test
Test
TEST
tEST
пушкин
пушкин
пушкин
пушкин
Пушкин
Пушкин
Пушкин
Пушкин
Пушкин
пушкин
Пушкин
Пушкин
пушкин

COUNT lower 8 upper 9


[-- Attachment #4 --]
--- /usr/src/share/mklocale/ru_RU.KOI8-R.src	Fri Nov 30 06:05:53 2001
+++ ru_RU.KOI8-R.src	Wed Dec  1 13:38:59 2004
@@ -13,27 +13,27 @@
 CONTROL		0x00 - 0x1f 0x7f
 DIGIT		'0' - '9'
 GRAPH		0x21 - 0x7e 0x80 - 0x99	0x9b - 0xff
-LOWER		'a' - 'z' 0xa3 0xc0 - 0xdf
+LOWER		'a' - 'z' 0xb3 0xe0 - 0xff
 PUNCT		0x21 - 0x2f 0x3a - 0x40 0x5b - 0x60 0x7b - 0x7e
 SPACE		0x09 - 0x0d 0x20 0x9a
-UPPER		'A' - 'Z' 0xb3 0xe0 - 0xff
+UPPER		'A' - 'Z' 0xa3 0xc0 - 0xdf
 XDIGIT          '0' - '9' 'a' - 'f' 'A' - 'F'
 BLANK		' ' '\t' 0x9a
 PRINT		0x20 - 0x7e 0x80 - 0xff
 
 MAPLOWER       	<'A' - 'Z' : 'a'>
 MAPLOWER       	<'a' - 'z' : 'a'>
-MAPLOWER	<0xb3  0xa3>
-MAPLOWER        <0xa3  0xa3>
-MAPLOWER	<0xe0 - 0xff : 0xc0>
-MAPLOWER	<0xc0 - 0xdf : 0xc0>
+MAPLOWER	<0xb3  0xb3>
+MAPLOWER        <0xa3  0xb3>
+MAPLOWER	<0xe0 - 0xff : 0xe0>
+MAPLOWER	<0xc0 - 0xdf : 0xe0>
 
 MAPUPPER       	<'A' - 'Z' : 'A'>
 MAPUPPER       	<'a' - 'z' : 'A'>
-MAPUPPER        <0xb3  0xb3>
-MAPUPPER	<0xa3  0xb3>
-MAPUPPER	<0xe0 - 0xff : 0xe0>
-MAPUPPER	<0xc0 - 0xdf : 0xe0>
+MAPUPPER        <0xb3  0xa3>
+MAPUPPER	<0xa3  0xa3>
+MAPUPPER	<0xe0 - 0xff : 0xc0>
+MAPUPPER	<0xc0 - 0xdf : 0xc0>
 
 TODIGIT       	<'0' - '9' : 0>
 TODIGIT       	<'A' - 'F' : 10>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1101908414.41adc9be50c73>