From owner-freebsd-bugs@FreeBSD.ORG Sun Sep 16 16:40:07 2007 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8503F16A417 for ; Sun, 16 Sep 2007 16:40:07 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 70EE413C45D for ; Sun, 16 Sep 2007 16:40:07 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l8GGe7Zx077746 for ; Sun, 16 Sep 2007 16:40:07 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l8GGe7iQ077745; Sun, 16 Sep 2007 16:40:07 GMT (envelope-from gnats) Date: Sun, 16 Sep 2007 16:40:07 GMT Message-Id: <200709161640.l8GGe7iQ077745@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org From: Andrey Chernov Cc: Subject: Re: gnu/116363: isspace broken for UTF-8 locales X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Andrey Chernov List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 16 Sep 2007 16:40:07 -0000 The following reply was made to PR gnu/116363; it has been noted by GNATS. From: Andrey Chernov To: Hye-Shik Chang Cc: Petr Hroudny , freebsd-gnats-submit@FreeBSD.org, jkoshy@FreeBSD.org, i18n@FreeBSD.org Subject: Re: gnu/116363: isspace broken for UTF-8 locales Date: Sun, 16 Sep 2007 20:34:07 +0400 On Mon, Sep 17, 2007 at 01:22:14AM +0900, Hye-Shik Chang wrote: > In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints. > Using the Unicode codepoint as wchar_t's internal representation gives > much benefit. I think we would be better to make isspace() and > other ctypes functions aware of "encoding". IIRC, tjr@ provided the > workaround as in the URL mentioned above and said that it would get > a chance to be fixed in 6 or 7 on 2004. Currently wchar_t represents given encoding in all places including wc<->mbr conversions. To make it UCS-4-only instead we need to rewrite the whole locale system from scratch and I see no benefits from that way. There is no simple workaround exists. In any case there is no excuse to make really-UCS-4.src to mimic UTF-8.src. Providing proper UTF-8.src is much less painful way than whole locale rewritting and I almost half way on converting UCS-4 source to it. -- http://ache.pp.ru/