Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 15 May 2001 03:39:52 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Valentin Nechayev <netch@iv.nn.kiev.ua>
Cc:        Alfred Perlstein <bright@wintelcom.net>, Erik Trulsson <ertr1013@student.uu.se>, hackers@FreeBSD.ORG
Subject:   Re: wint_t
Message-ID:  <3B010778.287FAF5E@mindspring.com>
References:  <20010514164401.A61243@dragon.nuxi.com> <20010515023221.A41666@student.uu.se> <20010514174502.J2009@fw.wintelcom.net> <20010515093610.A1835@iv.nn.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Valentin Nechayev wrote:
> Modern Unicode allows character codes more than 65534.
> wchar_t(65536) is Egyptian glyph;) Maximum allowed AFAIR is
> 2**31-1.  So at least 32 bits integer type required if you
> don't want adapt system to former millennium requires.

This argument came up on comp.lang.internat, when we were
first discussing the creation of the Unicode standard,
back when most of the people pushing it were Apple,
Taligent, and Adobe (it's no mistake that the "private
use" areas are discontiguous, such that active rendering
engines like Display PostScript and other Adobe technology
works for things like Hebrew, Arabic, Tamil, Devengari, and
other ligatured languages, but fixed cell prerendered fonts,
like those used in X servers, have to jump through serious
hoops to render those languages script properly).

This isn't really a realistic application for anything but
linguistic scholarship, since it's a dead language; I'm
sure the Trekkies will insist we add "Klingon", at some
point.  Probably, the Egyptian addition was intentional,
to push it over 16 bits, just to screw with Microsoft's
head.  I doubt if it will make it to ratified standard.

The issue with linguistic scholarship is that it's not
possible to deal with it without having a multilingual
application.

Internationalization, in general, is the process of
taking code, and making it so that it is possible to
localize it into a particular -- monolingual -- locale.

You need spacial software to deal with multilingual
text; the vast majority of software doesn't have to
do that (about the only place you will see it is in
a translator-used application).

In other words, you don't end up being able to display
characters that weren't in a particular round-trip
character set standard beforehand.

What are the ISO/ECMA round trip standards for Heiroglyphics,
for us to round-trip into and out of for rendering purposes?

8-).

> But wint_t must be no narrower than wchar_t. <curses.h>
> and <ncurses.h> define wchar_t as unsigned long. System
> headers define wchar_t as int (thru _BSD_WCHAR_T_ and
> _BSD_CT_RUNE_T_). This difference in size and signness
> is at least annoying. I suppose wchar_t should be
> __uint32_t, and wint_t - __int32_t, but this may break
> binary compatibility.

It's widely acknowledged that the reason for the 32 bit
code page size, where only the 0x0000???? was ever
populated in ISO 10646, was to appease the Japanese, who
disliked the code point unification of the CJK unification,
primarily because it used Chinese dictionary ordering, as
Chinese dictionaries do stroke+radical classification of
characters, and so were capable of ordering all of the
characters in the unified set, as a result of common and
repeatable rules.

Effectively, this left Japanese dictionary ordering in a
"less favorable" position, in that one could not just use
the ordinal value if the character to get the collation
sequence.  The humorous part of this is that there are
two common Japanese dictionary orderings, and a seperate
ordering used by NTT for the telephone directory (Germany
has two collation orders based on dictionary and telephone
directory ordering), so no matter how you approach it, it
is not possible to appease all Japanese users directly:
you must use an external collation sequence.  Actually,
the Japanese wanted JIS-208 + JIS-212 ordering.  So they
were not appeased by the change.  The primary opponent to
Unicode in Japan, if I had to pick one, is Matsumata Ohta.

The natural wchar_t size is 16 bits, and will remain that
size until code pages other than 0x0000???? are adopted
by ISO.

We are also living in a "Windows world"; that is to say
that Windows software pretty much dictates Unicode data
storage -- and Windows uses a 16-bit wchar_t.

If nothing else, Our VFAT32 and Windows NT Unicode name
spaces for filenames depends on 16 bit wchar_t lengths.

For a purely UNIX argument, consider that you would need
directory blocks in excess of 1k to store a 256 character
filename in "native Unicode", if you insisted on 32 bit
characters.

We already have abominations like UTF-8, UTF-7, and UTF-5
encodings (mostly to make the American ASCII (ISO 646)
bigots not have to fix their legacy systems correctly),
which play hell with fixed length field encodings in
records intended to contain textual data.

Doing that type of multibyte storage encoding breaks an
entire class of algorithms (e.g. variable length records
break "numrecs = filesize/sizeof(struct record);" type
calculations).  That's really a poor trade.


Probably the funniest thing is that users don't care,
as long as the code works.  Windows is sold in Japan,
and it brings Unicode in with it.  And users could care
less that it isn't JIS-208 + JIS-212, as long as their
email works.  Ohta-san's outspoken opposition, not
withstanding.

If they start having to double the amount of RAM in
their systems to get anything done, though, they will
start to care very quickly; and that's what a 32 bit
encoding means for any text processing tools.

I maintain that the correct size for wchar_t is 16 bits,
until someone can point to a character set that needs
more than that, and which has been ratified by a standards
body.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B010778.287FAF5E>