Date: Thu, 11 Jun 1998 23:07:43 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: itojun@iijlab.net (Jun-ichiro itojun Itoh) Cc: joy@urc.ac.ru, kline@tao.thought.org, tlambert@primenet.com, hackers@FreeBSD.ORG Subject: Re: internationalization Message-ID: <199806112307.QAA00116@usr09.primenet.com> In-Reply-To: <20418.897564813@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 08:33:33 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> Handling (searching/indexing) multilingual data and storing > multilingual data can be done in separate method (and I prefer > them to be orthogonal). I prefer collation sequence to be handled in a seperate methoid, as well. > IMHO, for storing information we must retain as much information as > possible, so iso-2022 wins here (because it is fully multilingual, > even from standpoint of asian language users). ISO 2022 is inferior to SGML. > If you store the original information using a format that unifies > part of information in the source (e.g. Unicode) you'll lose some of > the very important part in the file, and the lossage will not be > recovered. Not if you store the file in marked-up format, unless you are arguing that you can't store SGML tags in Unicode, but you *can* store them in ISO8859-1? > For example, if you convert all the file you have into uppercase > for searching, you'll never recover the uppercase/lowercase > information. This is why you compile your regular expressions: to save the expense of the duplcation and conversion to avoid damaging the original data. > Unicode's unification is quite similar to this, > for asian language speakers (especially multilingual-targetted > people). This is why there are round-trip character sets, and why there is still locale information required. > xpg4 (runelocale) library provides a beautiful way of establishing > (2) in the above. > You can have a source file with ANY encoding you prefer. If you > set environment variable LANG (setenv LANG=ja_JP.EUC), rune library > will convert everything into wchar_t on read, via functions like > fgetrune(). This is *NOT* beautiful, unless you are in the business of selling very fast microprocessors to people who already own fast microprocessors. Trading markup for storage encoding that doesn't match processing encoding is a bad trade. It increases processing overhead drastically for no real gain. > >I don't want to say we should stop using ISO 2022. I just want to say > >we shouldn't stop (should start) using Unicode. I.e. to use both > >of them, as both have their advantages and disadvantages. > > Yes, I agree that Unicode can be useful in some places. But I > do not like Unicode be the encoding for data sources (and Unicode > tend to be stressed toward that). That way important portion of > the information will be lost. Not if you encode it in-band, like the standard says you are supposed to do, using a markup language (preferaably a widely accepted standard, such as SGML or the SGML DTD for RTF). > For conversion, there seems to be a standard function defined such as > iconv(3) or iconv_open(3). I'm thinking of implementing this, but it > requires me to have a giant table, such as: > iso-2022<->unicode with japanese gryphs > iso-2022<->unicode with korean gryphs > iso-2022<->unicode with chinese gryphs > and more... > somewhere in the filesystem. It is not required in the kernel, except in support of legacy systems exporting FS's via NFS. Even then, it's still not required to be in the kernel, if you are willing to accept a latency penalty for accessing legacy systems that you are too stubborn (or otherwise unable) to upgrade like you should. > >Having an internationalized OS still require the ability of the user > >to comunicate with other, non-internationalized parties with 8-bit > >or other character sets. > > I maybe not getting what you mean here... NFS systems running an ISO 8859-1 character set are the most common deployed case. The data stream needs to be attributed in the kernel. It is very tempting to attribute files as "text" and convert only text files. For UTF2 (16 bit wchar_t process encoded Unicode or ISO 10646 0/16) encoded files, it is trivial to expand one page to two pages when memory mapping the file. It is a hell of a lot less trivial to memory map a EUC encoded JIS-208 (or UTF-7/8 encoded Unicode) file using only code points 0x0000 to 0x00ff. This is because there is not a fixed expansion/contraction ratio, nor is there a mechanism for faultin on non-page boundries in most modern processors. I'm not willing to give up the ability to memory map files that contain text. > Or, do you mean how to literally convert Japanese/Chinese into ASCII? > Yes, there are several ways. Such as ROMA-JI for Japanese > (I can write Japanese words in ASCII: "Fujiyama" "Geisha" "Sushi" > "Harakiri"), or ping-ying for Chinese (correct?). Or you can convert the Japanese words into Katakana, instead, so long as you are willing to use the appropriate 8-bit character set, and forsake Kanji to do it. The sword cuts both directions. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806112307.QAA00116>