Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Jun 2002 18:28:28 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Peter S. Housel" <housel@acm.org>
Cc:        current@FreeBSD.ORG, Thomas David Rivers <rivers@dignus.com>
Subject:   Re: PATCH: wchar_t is already defined in libstd++
Message-ID:  <3D0FDE3C.681A3207@mindspring.com>
References:  <200206181119.g5IBJX954922@lakes.dignus.com> <3D0F1D98.31B49358@mindspring.com> <004401c216ec$844088f0$6621010a@housel7352>

next in thread | previous in thread | raw e-mail | index | archive | help
"Peter S. Housel" wrote:
> > o Complete disdain for ISO-10646 being 32 bits, when 16
> > of them are never anything but 0, and were put there just
> > so that people could grep -v other people's languages out
> > of documents
> >
> > o I'll believe Hieroglyphics and Linear B when I see the
> > fonts and the programs that use them.  Dead languages
> > pretty much justify purpose-built linguistics software
> > anyway.
> 
> If you were a MathML user, or had a Chinese name using an obscure character,
> you would probably feel differently.

Why?  Have the Chinese sent representatives to an international
standards body to get code pages other than 0 filled in with
these characters?  Have the MathML users?

Basically, it's not necessary to have bits to represent these
code points until they are parts of a standard character set.
The entire point of Unicode was to provide round-trip capability
between character sets.

For MathML, you can actually unify the code points with Zapf or
other characters thatdon't exist simultaneously in any character
sets.  Alrternately, you could use a "private use" area.


> > o A desire for raw storage of Unicode, rather than UTF-8 or
> > UTF-7 encoding.  This last one is:
> 
> You still need at least 21 bits to have "raw storage of Unicode".  With
> anything less, either UTF-16 surrogates or UTF-8 multi-byte encodings have
> to be used.  With a 16-bit wchar_t, even if I personally don't have any text
> that uses characters beyond the BMP, I still have to write my code to
> account for surrogates.

Unicode 3.2.0 is not an ISO/IEC standard.  It's a political thing.

You might have an argument for ISO-10646-2:2001; however "Klingon"
is not a script I'm really worried about.  8-).


> > o People might accept doubling data size for the benefit
> > of internationalization.  They aren't going to accept
> > a random multiplier between 1 and 5.
> 
> I suspect UTF-16 doesn't compress very well using standard tools, and it is
> subject to byte-order difficulties.  (That goes double for UTF-32, of
> course.)  wchar_t probably shouldn't be directly used for storage.

Anything larger than a byte has byte order problems; that was one
of the original rationales for UTF-8 encoding.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D0FDE3C.681A3207>