Date: Thu, 11 Jun 1998 22:36:57 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: itojun@iijlab.net (Jun-ichiro itojun Itoh) Cc: joy@urc.ac.ru, kline@tao.thought.org, tlambert@primenet.com, hackers@FreeBSD.ORG Subject: Re: internationalization Message-ID: <199806112236.PAA28653@usr09.primenet.com> In-Reply-To: <11417.897551055@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 04:44:15 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> >> Yes, iso-2022 families are quite important for supporting > >> asian languages. Unicode is, for us Japanese, quite incomplete and > >> unexpandable. > >Do you mean Unicode does not cover all the CJK characters? > > Unicode maps different Chinese/Japanese/Korean letters into the same > codepoint. The actual appearance (gryph) will be determined by > the selection of font. (so, there will be font just for Chinese, > font just for Japanese, and font just for Korean). This is an oversimplification. There will be a font for each round-trip character set. Character sets for which standards existed that codified code points in different languages were not unified. For example, English and Japanese. This is only a problem in the case of trying to use two locales simultaneously. This never happens, unless you are a linguistic scholar or translator. For linguistic scholars and translators, the issue is resolved by using a markup language. The cost of using a markup language is paid by the people needing more than one locale at the same time. As opposed to all of us having to pay for it, the tiny number of people engaged in scholarship and translation have to pay for it. This is better because the people who benefit are made to pay for the benefit, instead of everyone shoulding the burden for a few unique applications. > Therefore, it may be sufficient for supporting single asian language > (for example Japanization), it is not sufficient for > multilingualization (C/J/K support at the same time). With Unicode, > you will never be able to write a plaintext with C/J/K letters mixed. > For example, I frequently write such a plaintext, for list of plates > for chinese restaurant, with description in Japanese attached. > Such a plaintext cannot be generated with Unicode. It can be generated with marked up Unicode, however. Unicode is a character set, not a font. For resons previously detailed, Unicode can *never* be a font. And was never intended as one. I defy you to show me a locale that supports both Japanese and Chinese file names simultaneously. You won't be able to do it because there is no character set standard that includes both all of the Japanese and all of the Chinese code points. > >What is "unexpandable"? > > Unicode people stressed Unicode because of the "fixed bitwidth" > nature of Unicode. Therefore, basically they will not be able to > support more than 2^16 letters. > Recently Unicode introduced "surrogate pair" which makes Unicode > a variable bitwidth character set. This breaks the key feature of > Unicode, and it shows that Unicode is not expandable as nature. > (Correct me if I'm wrong about "surrogate pair"...) I believe you are. The real issue is not Unicode, which is code page 0 of ISO 10646, but ISO 10646 itself, which supports 2^32 letters; 2^16 letters in each of 2^16 code pages. The only code page defined ringht now is code page 0/16, which is defined to be Unicode. > iso-2022 is well designed to accomodate new character sets to appear > later. Even with the most simplest subset it can accomodate bunch of > character sets. ISO 2022 is a font family markup standard, where font families are made identical to round-trip character sets. ISO 2022 is an *inferior* markup language, compared to SGML. > Handling bare iso-2022 string is some hard to implement because it > is variable length (yes I agree). If we can provide a good library > for iso-2022, then there's no reason for us to migrate to Unicode. Except that 85% of the computer systems in the world and 90% of the computers in the Western world are going to be running Unicode by the year 2010 because of Microsoft Windows and JAVA. And we would like to be able to interoperate with them without paying a very high conversion overhead when we do it. > >> Yes, for Japanese, Chinese and Korean iso-2022 based model (euc-xx > >> falls into the category) is really important. However, I > >Why not to support both ISO 2022 and Unicode? Yes, it is more difficult > >to implement. But otherwise we can lose compatibility with other systems. > > Of course my library support both of them. If you say > setrunelocale("UTF2"), the internal and external representation > will be come Unicode. If you say setrunelocale("ja_JP.iso-2022-jp") > it will be come Japanese iso-2022-jp encoding. This is certainly a step in the right direction; however, I would still deperately encourage the use of 16 bit wchar_t for internal data representation in programs operating in a single locale. The entire ISO 8859-X using world has 8 bit characters. Going to UTF2 is asking them to attribute FS's where possible, and where not possible, double the storage requirements for data. Going to 32 bits, especially given that ISO 10646, the largest character set standard you can point at, only defines code page 0/16, is madness. The Western world will simply refuse to bear the overhead of 4 times the dataspace requirements to benefit the few people making Chinese restraunt menus for use in Japan, and who refuse to use a markup language to do it. There are Western advocates, specifically those using US ASCII and 7 bit NRCS (National Replacement Character Sets) who advocate UTF-7 and UTF-8 encoding so that they don't have to change their existing data files to have their code support Japanese or Chinese. There's no real unified computing infrastructure in Japan (it being broken into vendor specific hardware hardware markets), and that makes it a lot of expense for very little potential market. It's going to be hard enough to convince the US idiots that trading more RAM for lower processing overhead is a good idea. The use of wchar_t as a font index is ill considered. The font is not the same as the character set, nor should it be. The index should be based on the relative offset into the font, and use a base+offset to deal with multiple fonts in a single rendering space. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806112236.PAA28653>