From owner-freebsd-hackers  Thu Jun 11 16:40:33 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id QAA01656
          for freebsd-hackers-outgoing; Thu, 11 Jun 1998 16:40:33 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from gershwin.tera.com (gershwin.tera.com [207.224.230.28])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA01410
          for <hackers@FreeBSD.ORG>; Thu, 11 Jun 1998 16:39:38 -0700 (PDT)
          (envelope-from kline@tao.thought.org)
Received: from tao.thought.org (tao.tera.com [207.108.223.55])
	by gershwin.tera.com (8.8.8/8.8.8) with ESMTP id QAA22064;
	Thu, 11 Jun 1998 16:38:58 -0700 (PDT)
Received: (from kline@localhost) by tao.thought.org (8.8.8/8.7.3) id QAA12958; Thu, 11 Jun 1998 16:38:46 -0700 (PDT)
From: Gary Kline <kline@tao.thought.org>
Message-Id: <199806112338.QAA12958@tao.thought.org>
Subject: Re: internationalization
In-Reply-To: <357F9CA0.F8F1DD61@urc.ac.ru> from Konstantin Chuguev at "Jun 11, 98 03:00:16 pm"
To: joy@urc.ac.ru (Konstantin Chuguev)
Date: Thu, 11 Jun 1998 16:38:45 -0700 (PDT)
Cc: itojun@iijlab.net, tlambert@primenet.com, hackers@FreeBSD.ORG
Organization: <> thought.org: public access uNix in service... <>
X-Mailer: ELM [version 2.4ME+ PL32 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

According to Konstantin Chuguev:
[Charset koi8-r unsupported, filtering to ASCII...]
> Jun-ichiro itojun Itoh wrote:
> > 
> > ?Do you mean Unicode does not cover all the CJK characters?
> > 
> >         Unicode maps different Chinese/Japanese/Korean letters into the same
> >         codepoint.  The actual appearance (gryph) will be determined by
> >         the selection of font. (so, there will be font just for Chinese,
> >         font just for Japanese, and font just for Korean).
> > 
> >         Therefore, it may be sufficient for supporting single asian language
> >         (for example Japanization), it is not sufficient for
> >         multilingualization (C/J/K support at the same time).  With Unicode,
> >         you will never be able to write a plaintext with C/J/K letters mixed.
> >         For example, I frequently write such a plaintext, for list of plates
> >         for chinese restaurant, with description in Japanese attached.
> >         Such a plaintext cannot be generated with Unicode.
> > 
> I see. Suppose it was made for saving space in the code table.
> And now, without external information about the language of the text,
> no one can properly render hieroglyphs.
> And I see ISO 2022 solves this problem for a plain text.
> 
> But, although text/plain is very suitable for Email messages, for
> example,
> it is very difficult to index/search such documents without additional
> information (at least about language used), as different languages
> have different rules for sorting their letters/glyphs. Searching
> in multilingual documents is even more painful.
> How it can be realized with ISO 2022?


		This is an issue for me, too.  Not immediately, but
		in several months when I've finished the utility-messaging.

		Using iso-2022, will I be able to collate the 
		character sets?  Or is this even relevant? 


> I still think a flat character set table has many advantages in this
> case.
> Plus, as I said before, large database of each character's
> characteristics in Unicode.
> 
> I don't want to say we should stop using ISO 2022. I just want to say
> we shouldn't stop (should start) using Unicode. I.e. to use both
> of them, as both have their advantages and disadvantages.
> 

		Yes!  If we could use both of these major 
		representations, that would serve well.  At
		least (or particularly) in the wchar_t languages.
		Use ISO for messages, for text-editors, and
		wherever else.  Use Unicode where it worked 
		better.   ...It seems to me somewhat like having
		to _choose_ between hex and decimal.


> >         Handling bare iso-2022 string is some hard to implement because it
> >         is variable length (yes I agree).  If we can provide a good library
> >         for iso-2022, then there's no reason for us to migrate to Unicode.
> > 
> I think handling ISO 2022 texts for database purposes can require
> conversion of characters into some internal fixed width table,
> where all existing characters have a unique code.
> Then we get a kind of just superset of Unicode.
> 
> For those Chinese/Japanese/Korean hieroglyphs, which now look
> differently,
> but have common historical root: I agree that they should have
> different character codes, at least because Latin, Cyrillic and Greek
> letters "A" are coded differently, although they have the same
> historical
> root as well.
> 
> We cannot perfectly describe any glyph's meaning without historical,
> language and some other contexts. If any glyph has ambiguity in its
> usage, this ambiguity has to be reflected in a database for
> automatic processing.
> One way is to code every glyph's variant for every language in the world
> uniquely. Another is to save space but develop additional algorithms
> for distinguishing variants for the context provided. Truth is somewhere
> in the middle.
> 
> I am not an expert in Unicode, just very interested person.
> Probably, we should consult with i18n teams of different authorities.
> 
> > ??         Yes, for Japanese, Chinese and Korean iso-2022 based model (euc-xx
> > ??         falls into the category) is really important.  However, I
> > ?Why not to support both ISO 2022 and Unicode? Yes, it is more difficult
> > ?to implement. But otherwise we can lose compatibility with other systems.
> > 
> >         Of course my library support both of them.  If you say
> >         setrunelocale("UTF2"), the internal and external representation
> >         will be come Unicode.  If you say setrunelocale("ja_JP.iso-2022-jp")
> >         it will be come Japanese iso-2022-jp encoding.
> > 
> >         I'll try to release my library with sample application sooner.
> >         I think I can give you the tarball at New Olreans :-)
> > 
> Great.
> What about conversion?
> 
> Having an internationalized OS still require the ability of the user
> to comunicate with other, non-internationalized parties with 8-bit
> or other character sets.
> 


		Is MIME a possible solution here?  

		A friend of mine currently studying in Japan sends
		me mail (in English!), but my mailer//MUA can't 
		understand it.  And I'm using MIME.  So there are
		bugs.  

		If someone sends me mail in 8859-1 from a 2022-jp
		platform, his kernel (or an optional) driver 
		should probably do the conversion.

		gary


> --
> 	Konstantin V. Chuguev.		System administrator of Southern
> 	http://www.urc.ac.ru/~joy/	Ural Regional Center of FREEnet,
> 	mailto:joy@urc.ac.ru		Chelyabinsk, Russia.
> 


-- 
   Gary D. Kline         kline@tao.thought.org          Public service uNix


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message