From owner-freebsd-hackers Thu Jun 11 16:40:33 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id QAA01656 for freebsd-hackers-outgoing; Thu, 11 Jun 1998 16:40:33 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from gershwin.tera.com (gershwin.tera.com [207.224.230.28]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA01410 for ; Thu, 11 Jun 1998 16:39:38 -0700 (PDT) (envelope-from kline@tao.thought.org) Received: from tao.thought.org (tao.tera.com [207.108.223.55]) by gershwin.tera.com (8.8.8/8.8.8) with ESMTP id QAA22064; Thu, 11 Jun 1998 16:38:58 -0700 (PDT) Received: (from kline@localhost) by tao.thought.org (8.8.8/8.7.3) id QAA12958; Thu, 11 Jun 1998 16:38:46 -0700 (PDT) From: Gary Kline Message-Id: <199806112338.QAA12958@tao.thought.org> Subject: Re: internationalization In-Reply-To: <357F9CA0.F8F1DD61@urc.ac.ru> from Konstantin Chuguev at "Jun 11, 98 03:00:16 pm" To: joy@urc.ac.ru (Konstantin Chuguev) Date: Thu, 11 Jun 1998 16:38:45 -0700 (PDT) Cc: itojun@iijlab.net, tlambert@primenet.com, hackers@FreeBSD.ORG Organization: <> thought.org: public access uNix in service... <> X-Mailer: ELM [version 2.4ME+ PL32 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG According to Konstantin Chuguev: [Charset koi8-r unsupported, filtering to ASCII...] > Jun-ichiro itojun Itoh wrote: > > > > ?Do you mean Unicode does not cover all the CJK characters? > > > > Unicode maps different Chinese/Japanese/Korean letters into the same > > codepoint. The actual appearance (gryph) will be determined by > > the selection of font. (so, there will be font just for Chinese, > > font just for Japanese, and font just for Korean). > > > > Therefore, it may be sufficient for supporting single asian language > > (for example Japanization), it is not sufficient for > > multilingualization (C/J/K support at the same time). With Unicode, > > you will never be able to write a plaintext with C/J/K letters mixed. > > For example, I frequently write such a plaintext, for list of plates > > for chinese restaurant, with description in Japanese attached. > > Such a plaintext cannot be generated with Unicode. > > > I see. Suppose it was made for saving space in the code table. > And now, without external information about the language of the text, > no one can properly render hieroglyphs. > And I see ISO 2022 solves this problem for a plain text. > > But, although text/plain is very suitable for Email messages, for > example, > it is very difficult to index/search such documents without additional > information (at least about language used), as different languages > have different rules for sorting their letters/glyphs. Searching > in multilingual documents is even more painful. > How it can be realized with ISO 2022? This is an issue for me, too. Not immediately, but in several months when I've finished the utility-messaging. Using iso-2022, will I be able to collate the character sets? Or is this even relevant? > I still think a flat character set table has many advantages in this > case. > Plus, as I said before, large database of each character's > characteristics in Unicode. > > I don't want to say we should stop using ISO 2022. I just want to say > we shouldn't stop (should start) using Unicode. I.e. to use both > of them, as both have their advantages and disadvantages. > Yes! If we could use both of these major representations, that would serve well. At least (or particularly) in the wchar_t languages. Use ISO for messages, for text-editors, and wherever else. Use Unicode where it worked better. ...It seems to me somewhat like having to _choose_ between hex and decimal. > > Handling bare iso-2022 string is some hard to implement because it > > is variable length (yes I agree). If we can provide a good library > > for iso-2022, then there's no reason for us to migrate to Unicode. > > > I think handling ISO 2022 texts for database purposes can require > conversion of characters into some internal fixed width table, > where all existing characters have a unique code. > Then we get a kind of just superset of Unicode. > > For those Chinese/Japanese/Korean hieroglyphs, which now look > differently, > but have common historical root: I agree that they should have > different character codes, at least because Latin, Cyrillic and Greek > letters "A" are coded differently, although they have the same > historical > root as well. > > We cannot perfectly describe any glyph's meaning without historical, > language and some other contexts. If any glyph has ambiguity in its > usage, this ambiguity has to be reflected in a database for > automatic processing. > One way is to code every glyph's variant for every language in the world > uniquely. Another is to save space but develop additional algorithms > for distinguishing variants for the context provided. Truth is somewhere > in the middle. > > I am not an expert in Unicode, just very interested person. > Probably, we should consult with i18n teams of different authorities. > > > ?? Yes, for Japanese, Chinese and Korean iso-2022 based model (euc-xx > > ?? falls into the category) is really important. However, I > > ?Why not to support both ISO 2022 and Unicode? Yes, it is more difficult > > ?to implement. But otherwise we can lose compatibility with other systems. > > > > Of course my library support both of them. If you say > > setrunelocale("UTF2"), the internal and external representation > > will be come Unicode. If you say setrunelocale("ja_JP.iso-2022-jp") > > it will be come Japanese iso-2022-jp encoding. > > > > I'll try to release my library with sample application sooner. > > I think I can give you the tarball at New Olreans :-) > > > Great. > What about conversion? > > Having an internationalized OS still require the ability of the user > to comunicate with other, non-internationalized parties with 8-bit > or other character sets. > Is MIME a possible solution here? A friend of mine currently studying in Japan sends me mail (in English!), but my mailer//MUA can't understand it. And I'm using MIME. So there are bugs. If someone sends me mail in 8859-1 from a 2022-jp platform, his kernel (or an optional) driver should probably do the conversion. gary > -- > Konstantin V. Chuguev. System administrator of Southern > http://www.urc.ac.ru/~joy/ Ural Regional Center of FREEnet, > mailto:joy@urc.ac.ru Chelyabinsk, Russia. > -- Gary D. Kline kline@tao.thought.org Public service uNix To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message