Date: Thu, 11 Jun 1998 15:00:16 +0600 From: Konstantin Chuguev <joy@urc.ac.ru> To: Jun-ichiro itojun Itoh <itojun@iijlab.net> Cc: Gary Kline <kline@tao.thought.org>, Terry Lambert <tlambert@primenet.com>, hackers@FreeBSD.ORG Subject: Re: internationalization Message-ID: <357F9CA0.F8F1DD61@urc.ac.ru> References: <11417.897551055@coconut.itojun.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Jun-ichiro itojun Itoh wrote: > > ?Do you mean Unicode does not cover all the CJK characters? > > Unicode maps different Chinese/Japanese/Korean letters into the same > codepoint. The actual appearance (gryph) will be determined by > the selection of font. (so, there will be font just for Chinese, > font just for Japanese, and font just for Korean). > > Therefore, it may be sufficient for supporting single asian language > (for example Japanization), it is not sufficient for > multilingualization (C/J/K support at the same time). With Unicode, > you will never be able to write a plaintext with C/J/K letters mixed. > For example, I frequently write such a plaintext, for list of plates > for chinese restaurant, with description in Japanese attached. > Such a plaintext cannot be generated with Unicode. > I see. Suppose it was made for saving space in the code table. And now, without external information about the language of the text, no one can properly render hieroglyphs. And I see ISO 2022 solves this problem for a plain text. But, although text/plain is very suitable for Email messages, for example, it is very difficult to index/search such documents without additional information (at least about language used), as different languages have different rules for sorting their letters/glyphs. Searching in multilingual documents is even more painful. How it can be realized with ISO 2022? I still think a flat character set table has many advantages in this case. Plus, as I said before, large database of each character's characteristics in Unicode. I don't want to say we should stop using ISO 2022. I just want to say we shouldn't stop (should start) using Unicode. I.e. to use both of them, as both have their advantages and disadvantages. > Handling bare iso-2022 string is some hard to implement because it > is variable length (yes I agree). If we can provide a good library > for iso-2022, then there's no reason for us to migrate to Unicode. > I think handling ISO 2022 texts for database purposes can require conversion of characters into some internal fixed width table, where all existing characters have a unique code. Then we get a kind of just superset of Unicode. For those Chinese/Japanese/Korean hieroglyphs, which now look differently, but have common historical root: I agree that they should have different character codes, at least because Latin, Cyrillic and Greek letters "A" are coded differently, although they have the same historical root as well. We cannot perfectly describe any glyph's meaning without historical, language and some other contexts. If any glyph has ambiguity in its usage, this ambiguity has to be reflected in a database for automatic processing. One way is to code every glyph's variant for every language in the world uniquely. Another is to save space but develop additional algorithms for distinguishing variants for the context provided. Truth is somewhere in the middle. I am not an expert in Unicode, just very interested person. Probably, we should consult with i18n teams of different authorities. > ?? Yes, for Japanese, Chinese and Korean iso-2022 based model (euc-xx > ?? falls into the category) is really important. However, I > ?Why not to support both ISO 2022 and Unicode? Yes, it is more difficult > ?to implement. But otherwise we can lose compatibility with other systems. > > Of course my library support both of them. If you say > setrunelocale("UTF2"), the internal and external representation > will be come Unicode. If you say setrunelocale("ja_JP.iso-2022-jp") > it will be come Japanese iso-2022-jp encoding. > > I'll try to release my library with sample application sooner. > I think I can give you the tarball at New Olreans :-) > Great. What about conversion? Having an internationalized OS still require the ability of the user to comunicate with other, non-internationalized parties with 8-bit or other character sets. -- Konstantin V. Chuguev. System administrator of Southern http://www.urc.ac.ru/~joy/ Ural Regional Center of FREEnet, mailto:joy@urc.ac.ru Chelyabinsk, Russia. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?357F9CA0.F8F1DD61>