FreeBSD Mail Archives

Date:      Thu, 11 Jun 1998 15:00:16 +0600
From:      Konstantin Chuguev <joy@urc.ac.ru>
To:        Jun-ichiro itojun Itoh <itojun@iijlab.net>
Cc:        Gary Kline <kline@tao.thought.org>, Terry Lambert <tlambert@primenet.com>, hackers@FreeBSD.ORG
Subject:   Re: internationalization
Message-ID:  <357F9CA0.F8F1DD61@urc.ac.ru>
References:  <11417.897551055@coconut.itojun.org>

index | next in thread | previous in thread | raw e-mail

Jun-ichiro itojun Itoh wrote:
> 
> ?Do you mean Unicode does not cover all the CJK characters?
> 
>         Unicode maps different Chinese/Japanese/Korean letters into the same
>         codepoint.  The actual appearance (gryph) will be determined by
>         the selection of font. (so, there will be font just for Chinese,
>         font just for Japanese, and font just for Korean).
> 
>         Therefore, it may be sufficient for supporting single asian language
>         (for example Japanization), it is not sufficient for
>         multilingualization (C/J/K support at the same time).  With Unicode,
>         you will never be able to write a plaintext with C/J/K letters mixed.
>         For example, I frequently write such a plaintext, for list of plates
>         for chinese restaurant, with description in Japanese attached.
>         Such a plaintext cannot be generated with Unicode.
> 
I see. Suppose it was made for saving space in the code table.
And now, without external information about the language of the text,
no one can properly render hieroglyphs.
And I see ISO 2022 solves this problem for a plain text.

But, although text/plain is very suitable for Email messages, for
example,
it is very difficult to index/search such documents without additional
information (at least about language used), as different languages
have different rules for sorting their letters/glyphs. Searching
in multilingual documents is even more painful.
How it can be realized with ISO 2022?
I still think a flat character set table has many advantages in this
case.
Plus, as I said before, large database of each character's
characteristics in Unicode.

I don't want to say we should stop using ISO 2022. I just want to say
we shouldn't stop (should start) using Unicode. I.e. to use both
of them, as both have their advantages and disadvantages.

>         Handling bare iso-2022 string is some hard to implement because it
>         is variable length (yes I agree).  If we can provide a good library
>         for iso-2022, then there's no reason for us to migrate to Unicode.
> 
I think handling ISO 2022 texts for database purposes can require
conversion of characters into some internal fixed width table,
where all existing characters have a unique code.
Then we get a kind of just superset of Unicode.

For those Chinese/Japanese/Korean hieroglyphs, which now look
differently,
but have common historical root: I agree that they should have
different character codes, at least because Latin, Cyrillic and Greek
letters "A" are coded differently, although they have the same
historical
root as well.

We cannot perfectly describe any glyph's meaning without historical,
language and some other contexts. If any glyph has ambiguity in its
usage, this ambiguity has to be reflected in a database for
automatic processing.
One way is to code every glyph's variant for every language in the world
uniquely. Another is to save space but develop additional algorithms
for distinguishing variants for the context provided. Truth is somewhere
in the middle.

I am not an expert in Unicode, just very interested person.
Probably, we should consult with i18n teams of different authorities.

> ??         Yes, for Japanese, Chinese and Korean iso-2022 based model (euc-xx
> ??         falls into the category) is really important.  However, I
> ?Why not to support both ISO 2022 and Unicode? Yes, it is more difficult
> ?to implement. But otherwise we can lose compatibility with other systems.
> 
>         Of course my library support both of them.  If you say
>         setrunelocale("UTF2"), the internal and external representation
>         will be come Unicode.  If you say setrunelocale("ja_JP.iso-2022-jp")
>         it will be come Japanese iso-2022-jp encoding.
> 
>         I'll try to release my library with sample application sooner.
>         I think I can give you the tarball at New Olreans :-)
> 
Great.
What about conversion?

Having an internationalized OS still require the ability of the user
to comunicate with other, non-internationalized parties with 8-bit
or other character sets.

--
	Konstantin V. Chuguev.		System administrator of Southern
	http://www.urc.ac.ru/~joy/	Ural Regional Center of FREEnet,
	mailto:joy@urc.ac.ru		Chelyabinsk, Russia.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?357F9CA0.F8F1DD61>

Header And Logo

Peripheral Links

Site Navigation