Date: Tue, 15 May 2001 03:39:52 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Valentin Nechayev <netch@iv.nn.kiev.ua> Cc: Alfred Perlstein <bright@wintelcom.net>, Erik Trulsson <ertr1013@student.uu.se>, hackers@FreeBSD.ORG Subject: Re: wint_t Message-ID: <3B010778.287FAF5E@mindspring.com> References: <20010514164401.A61243@dragon.nuxi.com> <20010515023221.A41666@student.uu.se> <20010514174502.J2009@fw.wintelcom.net> <20010515093610.A1835@iv.nn.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Valentin Nechayev wrote: > Modern Unicode allows character codes more than 65534. > wchar_t(65536) is Egyptian glyph;) Maximum allowed AFAIR is > 2**31-1. So at least 32 bits integer type required if you > don't want adapt system to former millennium requires. This argument came up on comp.lang.internat, when we were first discussing the creation of the Unicode standard, back when most of the people pushing it were Apple, Taligent, and Adobe (it's no mistake that the "private use" areas are discontiguous, such that active rendering engines like Display PostScript and other Adobe technology works for things like Hebrew, Arabic, Tamil, Devengari, and other ligatured languages, but fixed cell prerendered fonts, like those used in X servers, have to jump through serious hoops to render those languages script properly). This isn't really a realistic application for anything but linguistic scholarship, since it's a dead language; I'm sure the Trekkies will insist we add "Klingon", at some point. Probably, the Egyptian addition was intentional, to push it over 16 bits, just to screw with Microsoft's head. I doubt if it will make it to ratified standard. The issue with linguistic scholarship is that it's not possible to deal with it without having a multilingual application. Internationalization, in general, is the process of taking code, and making it so that it is possible to localize it into a particular -- monolingual -- locale. You need spacial software to deal with multilingual text; the vast majority of software doesn't have to do that (about the only place you will see it is in a translator-used application). In other words, you don't end up being able to display characters that weren't in a particular round-trip character set standard beforehand. What are the ISO/ECMA round trip standards for Heiroglyphics, for us to round-trip into and out of for rendering purposes? 8-). > But wint_t must be no narrower than wchar_t. <curses.h> > and <ncurses.h> define wchar_t as unsigned long. System > headers define wchar_t as int (thru _BSD_WCHAR_T_ and > _BSD_CT_RUNE_T_). This difference in size and signness > is at least annoying. I suppose wchar_t should be > __uint32_t, and wint_t - __int32_t, but this may break > binary compatibility. It's widely acknowledged that the reason for the 32 bit code page size, where only the 0x0000???? was ever populated in ISO 10646, was to appease the Japanese, who disliked the code point unification of the CJK unification, primarily because it used Chinese dictionary ordering, as Chinese dictionaries do stroke+radical classification of characters, and so were capable of ordering all of the characters in the unified set, as a result of common and repeatable rules. Effectively, this left Japanese dictionary ordering in a "less favorable" position, in that one could not just use the ordinal value if the character to get the collation sequence. The humorous part of this is that there are two common Japanese dictionary orderings, and a seperate ordering used by NTT for the telephone directory (Germany has two collation orders based on dictionary and telephone directory ordering), so no matter how you approach it, it is not possible to appease all Japanese users directly: you must use an external collation sequence. Actually, the Japanese wanted JIS-208 + JIS-212 ordering. So they were not appeased by the change. The primary opponent to Unicode in Japan, if I had to pick one, is Matsumata Ohta. The natural wchar_t size is 16 bits, and will remain that size until code pages other than 0x0000???? are adopted by ISO. We are also living in a "Windows world"; that is to say that Windows software pretty much dictates Unicode data storage -- and Windows uses a 16-bit wchar_t. If nothing else, Our VFAT32 and Windows NT Unicode name spaces for filenames depends on 16 bit wchar_t lengths. For a purely UNIX argument, consider that you would need directory blocks in excess of 1k to store a 256 character filename in "native Unicode", if you insisted on 32 bit characters. We already have abominations like UTF-8, UTF-7, and UTF-5 encodings (mostly to make the American ASCII (ISO 646) bigots not have to fix their legacy systems correctly), which play hell with fixed length field encodings in records intended to contain textual data. Doing that type of multibyte storage encoding breaks an entire class of algorithms (e.g. variable length records break "numrecs = filesize/sizeof(struct record);" type calculations). That's really a poor trade. Probably the funniest thing is that users don't care, as long as the code works. Windows is sold in Japan, and it brings Unicode in with it. And users could care less that it isn't JIS-208 + JIS-212, as long as their email works. Ohta-san's outspoken opposition, not withstanding. If they start having to double the amount of RAM in their systems to get anything done, though, they will start to care very quickly; and that's what a 32 bit encoding means for any text processing tools. I maintain that the correct size for wchar_t is 16 bits, until someone can point to a character set that needs more than that, and which has been ratified by a standards body. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B010778.287FAF5E>