Date: Sun, 16 Sep 2007 09:10:06 GMT From: Andrey Chernov <ache@nagual.pp.ru> To: freebsd-bugs@FreeBSD.org Subject: Re: gnu/116363: isspace broken for UTF-8 locales Message-ID: <200709160910.l8G9A6ts050905@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR gnu/116363; it has been noted by GNATS.
From: Andrey Chernov <ache@nagual.pp.ru>
To: Petr Hroudny <petr.hroudny@gmail.com>
Cc: freebsd-gnats-submit@FreeBSD.ORG, jkoshy@FreeBSD.ORG, perky@FreeBSD.ORG,
i18n@FreeBSD.ORG
Subject: Re: gnu/116363: isspace broken for UTF-8 locales
Date: Sun, 16 Sep 2007 12:54:33 +0400
On Sat, Sep 15, 2007 at 09:08:01AM +0000, Petr Hroudny wrote:
>
> >Number: 116363
> >Category: gnu
> >Synopsis: isspace broken for UTF-8 locales
> >Confidential: no
> >Severity: non-critical
> >Priority: medium
> >Responsible: freebsd-bugs
> >State: open
> >Quarter:
> >Keywords:
> >Date-Required:
> >Class: sw-bug
> >Submitter-Id: current-users
> >Arrival-Date: Sat Sep 15 09:10:02 GMT 2007
> >Closed-Date:
> >Last-Modified:
> >Originator: Petr Hroudny
> >Release: 6-stable, 7-current
> >Organization:
> >Environment:
> >Description:
> In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.
>
> In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.
>
> As a consequence, operations like str.upper() and/or str.split() are broken, when
> UTF-8 character with 0xA0 byte is encountered.
It seems that our UTF-8.src is completely wrong, it is just plain Unicode
and not UTF-8 which multibyte values should start from
C2-DF
E0-EF
F0-F4
only (as stated in http://en.wikipedia.org/wiki/UTF-8 f.e.)
Can anybody write replacement for it?
--
http://ache.pp.ru/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200709160910.l8G9A6ts050905>
