From owner-freebsd-current Tue Jun 18 18:29:53 2002 Delivered-To: freebsd-current@freebsd.org Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12]) by hub.freebsd.org (Postfix) with ESMTP id B5B5837B403 for ; Tue, 18 Jun 2002 18:29:45 -0700 (PDT) Received: from pool0336.cvx22-bradley.dialup.earthlink.net ([209.179.199.81] helo=mindspring.com) by harrier.mail.pas.earthlink.net with esmtp (Exim 3.33 #2) id 17KUI3-0006sf-00; Tue, 18 Jun 2002 18:29:43 -0700 Message-ID: <3D0FDE3C.681A3207@mindspring.com> Date: Tue, 18 Jun 2002 18:28:28 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Peter S. Housel" Cc: current@FreeBSD.ORG, Thomas David Rivers Subject: Re: PATCH: wchar_t is already defined in libstd++ References: <200206181119.g5IBJX954922@lakes.dignus.com> <3D0F1D98.31B49358@mindspring.com> <004401c216ec$844088f0$6621010a@housel7352> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "Peter S. Housel" wrote: > > o Complete disdain for ISO-10646 being 32 bits, when 16 > > of them are never anything but 0, and were put there just > > so that people could grep -v other people's languages out > > of documents > > > > o I'll believe Hieroglyphics and Linear B when I see the > > fonts and the programs that use them. Dead languages > > pretty much justify purpose-built linguistics software > > anyway. > > If you were a MathML user, or had a Chinese name using an obscure character, > you would probably feel differently. Why? Have the Chinese sent representatives to an international standards body to get code pages other than 0 filled in with these characters? Have the MathML users? Basically, it's not necessary to have bits to represent these code points until they are parts of a standard character set. The entire point of Unicode was to provide round-trip capability between character sets. For MathML, you can actually unify the code points with Zapf or other characters thatdon't exist simultaneously in any character sets. Alrternately, you could use a "private use" area. > > o A desire for raw storage of Unicode, rather than UTF-8 or > > UTF-7 encoding. This last one is: > > You still need at least 21 bits to have "raw storage of Unicode". With > anything less, either UTF-16 surrogates or UTF-8 multi-byte encodings have > to be used. With a 16-bit wchar_t, even if I personally don't have any text > that uses characters beyond the BMP, I still have to write my code to > account for surrogates. Unicode 3.2.0 is not an ISO/IEC standard. It's a political thing. You might have an argument for ISO-10646-2:2001; however "Klingon" is not a script I'm really worried about. 8-). > > o People might accept doubling data size for the benefit > > of internationalization. They aren't going to accept > > a random multiplier between 1 and 5. > > I suspect UTF-16 doesn't compress very well using standard tools, and it is > subject to byte-order difficulties. (That goes double for UTF-32, of > course.) wchar_t probably shouldn't be directly used for storage. Anything larger than a byte has byte order problems; that was one of the original rationales for UTF-8 encoding. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message