Date: Thu, 11 Jun 1998 20:33:33 +0900 From: Jun-ichiro itojun Itoh <itojun@iijlab.net> To: Konstantin Chuguev <joy@urc.ac.ru> Cc: Gary Kline <kline@tao.thought.org>, Terry Lambert <tlambert@primenet.com>, hackers@FreeBSD.ORG Subject: Re: internationalization Message-ID: <20418.897564813@coconut.itojun.org> In-Reply-To: joy's message of Thu, 11 Jun 1998 15:00:16 %2B0600. <357F9CA0.F8F1DD61@urc.ac.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
>I see. Suppose it was made for saving space in the code table. >And now, without external information about the language of the text, >no one can properly render hieroglyphs. >And I see ISO 2022 solves this problem for a plain text. >But, although text/plain is very suitable for Email messages, for >example, >it is very difficult to index/search such documents without additional >information (at least about language used), as different languages >have different rules for sorting their letters/glyphs. Searching >in multilingual documents is even more painful. >How it can be realized with ISO 2022? >I still think a flat character set table has many advantages in this >case. >Plus, as I said before, large database of each character's >characteristics in Unicode. Handling (searching/indexing) multilingual data and storing multilingual data can be done in separate method (and I prefer them to be orthogonal). IMHO, for storing information we must retain as much information as possible, so iso-2022 wins here (because it is fully multilingual, even from standpoint of asian language users). For searching, there are several ways: 1. Have some dictionary, or regular expressions, to unify the item to be searched. For example, following regular expression should match the all occurance that means "data". (data|datum) We can do this for multiple languages. 2. Have canonical form, just for handling/searching. This can be Unicode maybe, or this can be wchar_t (rune_t for xpg4). Convert the source into canonical form, perform search/index over the canonical form, get the result, and dump the text in canonical form. If you store the original information using a format that unifies part of information in the source (e.g. Unicode) you'll lose some of the very important part in the file, and the lossage will not be recovered. For example, if you convert all the file you have into uppercase for searching, you'll never recover the uppercase/lowercase information. Unicode's unification is quite similar to this, for asian language speakers (especially multilingual-targetted people). xpg4 (runelocale) library provides a beautiful way of establishing (2) in the above. You can have a source file with ANY encoding you prefer. If you set environment variable LANG (setenv LANG=ja_JP.EUC), rune library will convert everything into wchar_t on read, via functions like fgetrune(). Your program will take care of wchar_t only, and you can output the result in the original encoding via fputrune(). The beauty here is, the mapping between the source file and wchar_t can be switched by environment variable LANG. It is not fixed, so we can be open about the internal encoding of wchar_t. Currently implemented xpg4 library uses 16bit UCS2 for LANG=UTF2, and 16bit packed EUC form for LANG=ja_JP.EUC. My library uses 32bit packed form for importing iso-2022 encoded string into 32bit wchar_t. >I don't want to say we should stop using ISO 2022. I just want to say >we shouldn't stop (should start) using Unicode. I.e. to use both >of them, as both have their advantages and disadvantages. Yes, I agree that Unicode can be useful in some places. But I do not like Unicode be the encoding for data sources (and Unicode tend to be stressed toward that). That way important portion of the information will be lost. >> Of course my library support both of them. If you say >> setrunelocale("UTF2"), the internal and external representation >> will be come Unicode. If you say setrunelocale("ja_JP.iso-2022-jp") >> it will be come Japanese iso-2022-jp encoding. >> I'll try to release my library with sample application sooner. >> I think I can give you the tarball at New Olreans :-) >Great. >What about conversion? For conversion, there seems to be a standard function defined such as iconv(3) or iconv_open(3). I'm thinking of implementing this, but it requires me to have a giant table, such as: iso-2022<->unicode with japanese gryphs iso-2022<->unicode with korean gryphs iso-2022<->unicode with chinese gryphs and more... somewhere in the filesystem. >Having an internationalized OS still require the ability of the user >to comunicate with other, non-internationalized parties with 8-bit >or other character sets. I maybe not getting what you mean here... For tagging encoding method we have charset parameter for Content-type: MIME header If charset parameter is incompatible mailer can notify the user of the incompatibility. Also there's multipart/alternative MIME multipart so that the same content with multiple encoding can be transmitted. We must also have a way to restrict some text to conform to some defined charset (say, charset=iso-2022-jp). Or, do you mean how to literally convert Japanese/Chinese into ASCII? Yes, there are several ways. Such as ROMA-JI for Japanese (I can write Japanese words in ASCII: "Fujiyama" "Geisha" "Sushi" "Harakiri"), or ping-ying for Chinese (correct?). itojun To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20418.897564813>