Date: Wed, 19 Jun 2002 10:58:58 +1000 From: "Johny Mattsson (EPA)" <Johny.Mattsson@ericsson.com.au> To: "'freebsd-current@freebsd.org'" <freebsd-current@freebsd.org> Subject: RE: PATCH: wchar_t is already defined in libstd++ Message-ID: <4B6BC00CD15FD2119E5F0008C7A419A514C8BB48@eaubrnt018.epa.ericsson.se>
index | next in thread | raw e-mail
[-- Attachment #1 --] Hi Terry and all, I usually just lurk on the list, but since I'm a C++ afficionado, I wanted to question your below snipped statement. If we settle on wchar_t being 16bits, then we will still be forced to do UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string, since we must deal with that charming thing known as "surrogate pairs" (see section 3.7 of the Unicode standard v3.0). This again breaks the "one wchar_t == on character". When being forced to deal with Unicode, I much prefer working with 32bits, since that guarantees that I get a fixed length for each character. Admittedly, it is space inefficient to the Nth degree, but speedwise it is better. As for interoperability with Windows, it is clearly stated that the wchar_t is intended for internal usage only, and the various encoding schemes should be used when storing strings outside of a process. In reality this means that just about every Unicode capable application reads and writes in UTF-8 or 7. This means that interoperability should not become an issue. If it really was expected to have been an issue, I'm sure the C++ standard would have mandated a specific width for wchar_t, which as far as I am aware they didn't. The draft copy I pulled out via google says the following: Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (_lib.locale_). Type wchar_t shall have the same size, signedness, and alignment requirements (_intro.memory_) as one of the other integral types, called its underlying type. So, in the light of this, what would be the most appropriate choice? I haven't yet had a chance to explore what locales we support, but I would lean toward saying wchar_t == 32 bits, since this is future proof. If we later down the track are forced to go from 16 -> 32 due us supporting more of the asian locales, I foresee this causing _major_ breakage. If anyone actually has a copy of the C++ standard and would be kind enough to paste the section regarding the size of wchar_t, that would be most helpful for this discussion I believe. Regards, /Johny -- Johny Mattsson | Email: Johny.Mattsson@ericsson.com.au Ericsson Support Engineer | Phone: +61 (0)3 9301 1372 NCSA NetScreen Certified | Mobile: +61 (0)404 003 713 > -----Original Message----- > From: Terry Lambert [SMTP:tlambert2@mindspring.com] > Sent: Tuesday, June 18, 2002 9:47 PM > To: Thomas David Rivers > Cc: mb@imp.ch; current@FreeBSD.ORG; wollman@lcs.mit.edu > Subject: Re: PATCH: wchar_t is already defined in libstd++ > > > o A desire for raw storage of Unicode, rather than UTF-8 or > UTF-7 encoding. This last one is: > > o UTF encoding breaks fixed field storage, which has > always bean a measure of the number of characters > you can put in a field. > [-- Attachment #2 --] <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=US-ASCII"> <META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2654.19"> <TITLE>RE: PATCH: wchar_t is already defined in libstd++</TITLE> </HEAD> <BODY> <P><FONT SIZE=2 FACE="Arial">Hi Terry and all,</FONT> </P> <P><FONT SIZE=2 FACE="Arial">I usually just lurk on the list, but since I'm a C++ afficionado, I wanted to question your below snipped statement.</FONT> </P> <P><FONT SIZE=2 FACE="Arial">If we settle on wchar_t being 16bits, then we will still be forced to do UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string, since we must deal with that charming thing known as "surrogate pairs" (see section 3.7 of the Unicode standard v3.0). This again breaks the "one wchar_t == on character". When being forced to deal with Unicode, I much prefer working with 32bits, since that guarantees that I get a fixed length for each character. Admittedly, it is space inefficient to the Nth degree, but speedwise it is better.</FONT></P> <P><FONT SIZE=2 FACE="Arial">As for interoperability with Windows, it is clearly stated that the wchar_t is intended for internal usage only, and the various encoding schemes should be used when storing strings outside of a process. In reality this means that just about every Unicode capable application reads and writes in UTF-8 or 7. This means that interoperability should not become an issue. If it really was expected to have been an issue, I'm sure the C++ standard would have mandated a specific width for wchar_t, which as far as I am aware they didn't. The draft copy I pulled out via google says the following:</FONT></P> <P><FONT SIZE=2 FACE="Arial">Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (_lib.locale_). Type wchar_t shall have the same size, signedness, and alignment requirements (_intro.memory_) as one of the other integral types, called its underlying type.</FONT></P> <P><FONT SIZE=2 FACE="Arial">So, in the light of this, what would be the most appropriate choice? I haven't yet had a chance to explore what locales we support, but I would lean toward saying wchar_t == 32 bits, since this is future proof. If we later down the track are forced to go from 16 -> 32 due us supporting more of the asian locales, I foresee this causing _major_ breakage.</FONT></P> <P><FONT SIZE=2 FACE="Arial">If anyone actually has a copy of the C++ standard and would be kind enough to paste the section regarding the size of wchar_t, that would be most helpful for this discussion I believe.</FONT></P> <P><FONT SIZE=2 FACE="Arial">Regards,<BR> /Johny</FONT> <BR><FONT SIZE=2 FACE="Arial">--</FONT> <BR><FONT SIZE=2 FACE="Arial">Johny Mattsson | Email: Johny.Mattsson@ericsson.com.au</FONT> <BR><FONT SIZE=2 FACE="Arial">Ericsson Support Engineer | Phone: +61 (0)3 9301 1372</FONT> <BR><FONT SIZE=2 FACE="Arial">NCSA NetScreen Certified | Mobile: +61 (0)404 003 713</FONT> </P> <BR> <P><FONT SIZE=2 FACE="Arial">-----Original Message-----</FONT> <BR><FONT SIZE=2 FACE="Arial">From: Terry Lambert [SMTP:tlambert2@mindspring.com]</FONT> <BR><FONT SIZE=2 FACE="Arial">Sent: Tuesday, June 18, 2002 9:47 PM</FONT> <BR><FONT SIZE=2 FACE="Arial">To: Thomas David Rivers</FONT> <BR><FONT SIZE=2 FACE="Arial">Cc: mb@imp.ch; current@FreeBSD.ORG; wollman@lcs.mit.edu</FONT> <BR><FONT SIZE=2 FACE="Arial">Subject: Re: PATCH: wchar_t is already defined in libstd++</FONT> </P> <BR> <P><FONT SIZE=2 FACE="Arial">o A desire for raw storage of Unicode, rather than UTF-8 or</FONT> <BR> <FONT SIZE=2 FACE="Arial">UTF-7 encoding. This last one is:</FONT> </P> <P> <FONT SIZE=2 FACE="Arial">o UTF encoding breaks fixed field storage, which has</FONT> <BR> <FONT SIZE=2 FACE="Arial">always bean a measure of the number of characters</FONT> <BR> <FONT SIZE=2 FACE="Arial">you can put in a field.</FONT> </P> </BODY> </HTML>home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B6BC00CD15FD2119E5F0008C7A419A514C8BB48>
