Date: Tue, 18 Jun 2002 18:28:28 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: "Peter S. Housel" <housel@acm.org> Cc: current@FreeBSD.ORG, Thomas David Rivers <rivers@dignus.com> Subject: Re: PATCH: wchar_t is already defined in libstd++ Message-ID: <3D0FDE3C.681A3207@mindspring.com> References: <200206181119.g5IBJX954922@lakes.dignus.com> <3D0F1D98.31B49358@mindspring.com> <004401c216ec$844088f0$6621010a@housel7352>
next in thread | previous in thread | raw e-mail | index | archive | help
"Peter S. Housel" wrote: > > o Complete disdain for ISO-10646 being 32 bits, when 16 > > of them are never anything but 0, and were put there just > > so that people could grep -v other people's languages out > > of documents > > > > o I'll believe Hieroglyphics and Linear B when I see the > > fonts and the programs that use them. Dead languages > > pretty much justify purpose-built linguistics software > > anyway. > > If you were a MathML user, or had a Chinese name using an obscure character, > you would probably feel differently. Why? Have the Chinese sent representatives to an international standards body to get code pages other than 0 filled in with these characters? Have the MathML users? Basically, it's not necessary to have bits to represent these code points until they are parts of a standard character set. The entire point of Unicode was to provide round-trip capability between character sets. For MathML, you can actually unify the code points with Zapf or other characters thatdon't exist simultaneously in any character sets. Alrternately, you could use a "private use" area. > > o A desire for raw storage of Unicode, rather than UTF-8 or > > UTF-7 encoding. This last one is: > > You still need at least 21 bits to have "raw storage of Unicode". With > anything less, either UTF-16 surrogates or UTF-8 multi-byte encodings have > to be used. With a 16-bit wchar_t, even if I personally don't have any text > that uses characters beyond the BMP, I still have to write my code to > account for surrogates. Unicode 3.2.0 is not an ISO/IEC standard. It's a political thing. You might have an argument for ISO-10646-2:2001; however "Klingon" is not a script I'm really worried about. 8-). > > o People might accept doubling data size for the benefit > > of internationalization. They aren't going to accept > > a random multiplier between 1 and 5. > > I suspect UTF-16 doesn't compress very well using standard tools, and it is > subject to byte-order difficulties. (That goes double for UTF-32, of > course.) wchar_t probably shouldn't be directly used for storage. Anything larger than a byte has byte order problems; that was one of the original rationales for UTF-8 encoding. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D0FDE3C.681A3207>