RE: PATCH: wchar_t is already defined in libstd++

From owner-freebsd-current Tue Jun 18 17:59:18 2002 Delivered-To: freebsd-current@freebsd.org Received: from ish7.ericsson.com.au (ish7.ericsson.com.au [203.61.155.111]) by hub.freebsd.org (Postfix) with ESMTP id 1A98B37B40F for ; Tue, 18 Jun 2002 17:59:09 -0700 (PDT) Received: from brsf10.epa.ericsson.se (brsf10 [146.11.8.4]) by ish7.ericsson.com.au (8.11.6+Sun/8.11.6) with ESMTP id g5J0vOg03645 for ; Wed, 19 Jun 2002 10:57:24 +1000 (EST) Received: from eaubrnt019.epa.ericsson.se (eaubrnt019.epa.ericsson.se [146.11.9.165]) by brsf10.epa.ericsson.se (8.11.6+Sun/8.11.6) with ESMTP id g5J0x7412529 for ; Wed, 19 Jun 2002 10:59:07 +1000 (EST) Received: by eaubrnt019.epa.ericsson.se with Internet Mail Service (5.5.2653.19) id ; Wed, 19 Jun 2002 10:59:06 +1000 Message-ID: <4B6BC00CD15FD2119E5F0008C7A419A514C8BB48@eaubrnt018.epa.ericsson.se> From: "Johny Mattsson (EPA)" To: "'freebsd-current@freebsd.org'" Subject: RE: PATCH: wchar_t is already defined in libstd++ Date: Wed, 19 Jun 2002 10:58:58 +1000 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C2172C.7E4C88A0" Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01C2172C.7E4C88A0 Content-Type: text/plain Hi Terry and all, I usually just lurk on the list, but since I'm a C++ afficionado, I wanted to question your below snipped statement. If we settle on wchar_t being 16bits, then we will still be forced to do UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string, since we must deal with that charming thing known as "surrogate pairs" (see section 3.7 of the Unicode standard v3.0). This again breaks the "one wchar_t == on character". When being forced to deal with Unicode, I much prefer working with 32bits, since that guarantees that I get a fixed length for each character. Admittedly, it is space inefficient to the Nth degree, but speedwise it is better. As for interoperability with Windows, it is clearly stated that the wchar_t is intended for internal usage only, and the various encoding schemes should be used when storing strings outside of a process. In reality this means that just about every Unicode capable application reads and writes in UTF-8 or 7. This means that interoperability should not become an issue. If it really was expected to have been an issue, I'm sure the C++ standard would have mandated a specific width for wchar_t, which as far as I am aware they didn't. The draft copy I pulled out via google says the following: Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (_lib.locale_). Type wchar_t shall have the same size, signedness, and alignment requirements (_intro.memory_) as one of the other integral types, called its underlying type. So, in the light of this, what would be the most appropriate choice? I haven't yet had a chance to explore what locales we support, but I would lean toward saying wchar_t == 32 bits, since this is future proof. If we later down the track are forced to go from 16 -> 32 due us supporting more of the asian locales, I foresee this causing _major_ breakage. If anyone actually has a copy of the C++ standard and would be kind enough to paste the section regarding the size of wchar_t, that would be most helpful for this discussion I believe. Regards, /Johny -- Johny Mattsson | Email: Johny.Mattsson@ericsson.com.au Ericsson Support Engineer | Phone: +61 (0)3 9301 1372 NCSA NetScreen Certified | Mobile: +61 (0)404 003 713 > -----Original Message----- > From: Terry Lambert [SMTP:tlambert2@mindspring.com] > Sent: Tuesday, June 18, 2002 9:47 PM > To: Thomas David Rivers > Cc: mb@imp.ch; current@FreeBSD.ORG; wollman@lcs.mit.edu > Subject: Re: PATCH: wchar_t is already defined in libstd++ > > > o A desire for raw storage of Unicode, rather than UTF-8 or > UTF-7 encoding. This last one is: > > o UTF encoding breaks fixed field storage, which has > always bean a measure of the number of characters > you can put in a field. > ------_=_NextPart_001_01C2172C.7E4C88A0 Content-Type: text/html Content-Transfer-Encoding: quoted-printable RE: PATCH: wchar_t is already defined in libstd++

Hi Terry and all,

I usually just lurk on the list, but = since I'm a C++ afficionado, I wanted to question your below snipped = statement.

If we settle on wchar_t being 16bits, = then we will still be forced to do UTF-7/8/16 to properly handle a = random Unicode (or ISO/IEC 10646) string, since we must deal with that = charming thing known as "surrogate pairs" (see section 3.7 of = the Unicode standard v3.0). This again breaks the "one wchar_t = =3D=3D on character". When being forced to deal with Unicode, I = much prefer working with 32bits, since that guarantees that I get a = fixed length for each character. Admittedly, it is space inefficient to = the Nth degree, but speedwise it is better.

As for interoperability with Windows, = it is clearly stated that the wchar_t is intended for internal usage = only, and the various encoding schemes should be used when storing = strings outside of a process. In reality this means that just about = every Unicode capable application reads and writes in UTF-8 or 7. This = means that interoperability should not become an issue. If it really = was expected to have been an issue, I'm sure the C++ standard would = have mandated a specific width for wchar_t, which as far as I am aware = they didn't. The draft copy I pulled out via google says the = following:

Type wchar_t is a = distinct type whose values can represent distinct codes for all members = of the largest extended character set specified among = the supported locales (_lib.locale_). Type wchar_t shall = have the same size, signedness, and alignment requirements = (_intro.memory_) as one of the other integral types, called its = underlying type.

So, in the light of this, what would = be the most appropriate choice? I haven't yet had a chance to explore = what locales we support, but I would lean toward saying wchar_t =3D=3D = 32 bits, since this is future proof. If we later down the track are = forced to go from 16 -> 32 due us supporting more of the asian = locales, I foresee this causing _major_ breakage.

If anyone actually has a copy of the = C++ standard and would be kind enough to paste the section regarding = the size of wchar_t, that would be most helpful for this discussion I = believe.

Regards,
/Johny
--
Johny = Mattsson          &nbs= p;     | Email: = Johny.Mattsson@ericsson.com.au
Ericsson Support = Engineer       | Phone: +61 (0)3 9301 = 1372
NCSA NetScreen = Certified        | Mobile: +61 = (0)404 003 713

-----Original Message-----
From:   Terry Lambert = [SMTP:tlambert2@mindspring.com]
Sent:   Tuesday, June 18, = 2002 9:47 PM
To:     Thomas = David Rivers
Cc:     = mb@imp.ch; current@FreeBSD.ORG; wollman@lcs.mit.edu
Subject:        Re: = PATCH: wchar_t is already defined in libstd++

o = A desire for raw storage of Unicode, rather than UTF-8 or
UTF-7 encoding. This last one is:

        o       UTF encoding = breaks fixed field storage, which has
        =         always bean a measure of the number of characters
        =         you can put in a field.

------_=_NextPart_001_01C2172C.7E4C88A0-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message