From owner-freebsd-i18n@FreeBSD.ORG  Sun Sep 16 16:23:49 2007
Return-Path: <owner-freebsd-i18n@FreeBSD.ORG>
Delivered-To: i18n@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 73B1A16A468;
	Sun, 16 Sep 2007 16:23:49 +0000 (UTC)
	(envelope-from perky@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 622DD13C491;
	Sun, 16 Sep 2007 16:23:49 +0000 (UTC)
	(envelope-from perky@FreeBSD.org)
Received: from freefall.freebsd.org (perky@localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l8GGNnXP074801;
	Sun, 16 Sep 2007 16:23:49 GMT
	(envelope-from perky@freefall.freebsd.org)
Received: from localhost (localhost [[UNIX: localhost]])
	by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l8GGNnKi074800;
	Sun, 16 Sep 2007 16:23:49 GMT (envelope-from perky)
Date: Mon, 17 Sep 2007 01:22:14 +0900
From: Hye-Shik Chang <perky@FreeBSD.ORG>
To: Andrey Chernov <ache@nagual.pp.ru>, Petr Hroudny <petr.hroudny@gmail.com>, 
	freebsd-gnats-submit@FreeBSD.ORG, jkoshy@FreeBSD.ORG, i18n@FreeBSD.ORG
Message-ID: <20070916162214.GA49139@FreeBSD.org>
References: <200709150908.l8F981jj075109@www.freebsd.org>
	<20070916085432.GA8884@nagual.pp.ru>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20070916085432.GA8884@nagual.pp.ru>
User-Agent: Mutt/1.4.2.3i
X-Accept-Language: ko, en
Cc: 
Subject: Re: gnu/116363: isspace broken for UTF-8 locales
X-BeenThere: freebsd-i18n@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD Internationalization Effort <freebsd-i18n.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-i18n>
List-Post: <mailto:freebsd-i18n@freebsd.org>
List-Help: <mailto:freebsd-i18n-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 16 Sep 2007 16:23:49 -0000

On Sun, Sep 16, 2007 at 12:54:33PM +0400, Andrey Chernov wrote:
> On Sat, Sep 15, 2007 at 09:08:01AM +0000, Petr Hroudny wrote:
> > 
> > >Number:         116363
> > >Category:       gnu
> > >Synopsis:       isspace broken for UTF-8 locales
> > >Confidential:   no
> > >Severity:       non-critical
> > >Priority:       medium
> > >Responsible:    freebsd-bugs
> > >State:          open
> > >Quarter:        
> > >Keywords:       
> > >Date-Required:
> > >Class:          sw-bug
> > >Submitter-Id:   current-users
> > >Arrival-Date:   Sat Sep 15 09:10:02 GMT 2007
> > >Closed-Date:
> > >Last-Modified:
> > >Originator:     Petr Hroudny
> > >Release:        6-stable, 7-current
> > >Organization:
> > >Environment:
> > >Description:
> > In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.
> > 
> > In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.
> > 
> > As a consequence, operations like str.upper() and/or str.split() are broken, when
> > UTF-8 character with 0xA0 byte is encountered.

If you are saying about Python's str.split(), the problem is due
to our libc bug (or feature) which is described many times before,
and Python already includes a workaround for the problem.
http://mail.python.org/pipermail/python-checkins/2004-August/042343.html

> It seems that our UTF-8.src is completely wrong, it is just plain Unicode 
> and not UTF-8 which multibyte values should start from
> C2-DF
> E0-EF
> F0-F4
> only (as stated in http://en.wikipedia.org/wiki/UTF-8 f.e.)
> Can anybody write replacement for it?

In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints.
Using the Unicode codepoint as wchar_t's internal representation gives
much benefit.  I think we would be better to make isspace() and
other ctypes functions aware of "encoding".  IIRC, tjr@ provided the
workaround as in the URL mentioned above and said that it would get
a chance to be fixed in 6 or 7 on 2004.

Hye-Shik