Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Jun 2002 10:58:58 +1000
From:      "Johny Mattsson (EPA)" <Johny.Mattsson@ericsson.com.au>
To:        "'freebsd-current@freebsd.org'" <freebsd-current@freebsd.org>
Subject:   RE: PATCH: wchar_t is already defined in libstd++
Message-ID:  <4B6BC00CD15FD2119E5F0008C7A419A514C8BB48@eaubrnt018.epa.ericsson.se>

next in thread | raw e-mail | index | archive | help
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C2172C.7E4C88A0
Content-Type: text/plain

Hi Terry and all,

I usually just lurk on the list, but since I'm a C++ afficionado, I wanted
to question your below snipped statement.

If we settle on wchar_t being 16bits, then we will still be forced to do
UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string,
since we must deal with that charming thing known as "surrogate pairs" (see
section 3.7 of the Unicode standard v3.0). This again breaks the "one
wchar_t == on character". When being forced to deal with Unicode, I much
prefer working with 32bits, since that guarantees that I get a fixed length
for each character. Admittedly, it is space inefficient to the Nth degree,
but speedwise it is better.
As for interoperability with Windows, it is clearly stated that the wchar_t
is intended for internal usage only, and the various encoding schemes should
be used when storing strings outside of a process. In reality this means
that just about every Unicode capable application reads and writes in UTF-8
or 7. This means that interoperability should not become an issue. If it
really was expected to have been an issue, I'm sure the C++ standard would
have mandated a specific width for wchar_t, which as far as I am aware they
didn't. The draft copy I pulled out via google says the following:
Type  wchar_t  is  a distinct type whose values can represent distinct codes
for all members of the largest extended character set  specified among  the
supported locales (_lib.locale_).  Type wchar_t shall have the same size,
signedness, and alignment requirements (_intro.memory_) as one of the other
integral types, called its underlying type.
So, in the light of this, what would be the most appropriate choice? I
haven't yet had a chance to explore what locales we support, but I would
lean toward saying wchar_t == 32 bits, since this is future proof. If we
later down the track are forced to go from 16 -> 32 due us supporting more
of the asian locales, I foresee this causing _major_ breakage.
If anyone actually has a copy of the C++ standard and would be kind enough
to paste the section regarding the size of wchar_t, that would be most
helpful for this discussion I believe.
Regards,
/Johny
--
Johny Mattsson                	| Email: Johny.Mattsson@ericsson.com.au
Ericsson Support Engineer	| Phone: +61 (0)3 9301 1372
NCSA NetScreen Certified	| Mobile: +61 (0)404 003 713


> -----Original Message-----
> From:	Terry Lambert [SMTP:tlambert2@mindspring.com]
> Sent:	Tuesday, June 18, 2002 9:47 PM
> To:	Thomas David Rivers
> Cc:	mb@imp.ch; current@FreeBSD.ORG; wollman@lcs.mit.edu
> Subject:	Re: PATCH: wchar_t is already defined in libstd++
> 
> 
> o	A desire for raw storage of Unicode, rather than UTF-8 or
> 	UTF-7 encoding.  This last one is:
> 
> 	o	UTF encoding breaks fixed field storage, which has
> 		always bean a measure of the number of characters
> 		you can put in a field.
> 

------_=_NextPart_001_01C2172C.7E4C88A0
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3DUS-ASCII">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
5.5.2654.19">
<TITLE>RE: PATCH: wchar_t is already defined in libstd++</TITLE>
</HEAD>
<BODY>

<P><FONT SIZE=3D2 FACE=3D"Arial">Hi Terry and all,</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">I usually just lurk on the list, but =
since I'm a C++ afficionado, I wanted to question your below snipped =
statement.</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">If we settle on wchar_t being 16bits, =
then we will still be forced to do UTF-7/8/16 to properly handle a =
random Unicode (or ISO/IEC 10646) string, since we must deal with that =
charming thing known as &quot;surrogate pairs&quot; (see section 3.7 of =
the Unicode standard v3.0). This again breaks the &quot;one wchar_t =
=3D=3D on character&quot;. When being forced to deal with Unicode, I =
much prefer working with 32bits, since that guarantees that I get a =
fixed length for each character. Admittedly, it is space inefficient to =
the Nth degree, but speedwise it is better.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">As for interoperability with Windows, =
it is clearly stated that the wchar_t is intended for internal usage =
only, and the various encoding schemes should be used when storing =
strings outside of a process. In reality this means that just about =
every Unicode capable application reads and writes in UTF-8 or 7. This =
means that interoperability should not become an issue. If it really =
was expected to have been an issue, I'm sure the C++ standard would =
have mandated a specific width for wchar_t, which as far as I am aware =
they didn't. The draft copy I pulled out via google says the =
following:</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Type&nbsp; wchar_t&nbsp; is&nbsp; a =
distinct type whose values can represent distinct codes for all members =
of the largest extended character set&nbsp; specified among&nbsp; =
the&nbsp; supported locales (_lib.locale_).&nbsp; Type wchar_t shall =
have the same size, signedness, and alignment requirements =
(_intro.memory_) as one of the other integral types, called its =
underlying type.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">So, in the light of this, what would =
be the most appropriate choice? I haven't yet had a chance to explore =
what locales we support, but I would lean toward saying wchar_t =3D=3D =
32 bits, since this is future proof. If we later down the track are =
forced to go from 16 -&gt; 32 due us supporting more of the asian =
locales, I foresee this causing _major_ breakage.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">If anyone actually has a copy of the =
C++ standard and would be kind enough to paste the section regarding =
the size of wchar_t, that would be most helpful for this discussion I =
believe.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Regards,<BR>
/Johny</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">--</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">Johny =
Mattsson&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; | Email: =
Johny.Mattsson@ericsson.com.au</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">Ericsson Support =
Engineer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Phone: +61 (0)3 9301 =
1372</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">NCSA NetScreen =
Certified&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Mobile: +61 =
(0)404 003 713</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Arial">-----Original Message-----</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">From:&nbsp;&nbsp; Terry Lambert =
[SMTP:tlambert2@mindspring.com]</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">Sent:&nbsp;&nbsp; Tuesday, June 18, =
2002 9:47 PM</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">To:&nbsp;&nbsp;&nbsp;&nbsp; Thomas =
David Rivers</FONT>
<BR><FONT SIZE=3D2 FACE=3D"Arial">Cc:&nbsp;&nbsp;&nbsp;&nbsp; =
mb@imp.ch; current@FreeBSD.ORG; wollman@lcs.mit.edu</FONT>
<BR><FONT SIZE=3D2 =
FACE=3D"Arial">Subject:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Re: =
PATCH: wchar_t is already defined in libstd++</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Arial">o&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
A desire for raw storage of Unicode, rather than UTF-8 or</FONT>
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=3D2 =
FACE=3D"Arial">UTF-7 encoding.&nbsp; This last one is:</FONT>
</P>

<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=3D2 =
FACE=3D"Arial">o&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; UTF encoding =
breaks fixed field storage, which has</FONT>
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=3D2 =
FACE=3D"Arial">always bean a measure of the number of characters</FONT>
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=3D2 =
FACE=3D"Arial">you can put in a field.</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C2172C.7E4C88A0--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B6BC00CD15FD2119E5F0008C7A419A514C8BB48>