Date: Wed, 12 Mar 1997 11:00:47 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: jfieber@indiana.edu (John Fieber) Cc: terry@lambert.org, pam@polynet.lviv.ua, hackers@freebsd.org Subject: Re: Q: Locale - is it possible to change on the fly? Message-ID: <199703121800.LAA27652@phaeton.artisoft.com> In-Reply-To: <Pine.BSF.3.95q.970311220457.26807G-100000@fallout.campusview.indiana.edu> from "John Fieber" at Mar 11, 97 11:03:29 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > Like Unicode, it is a tool for localization, not multinationalization; > > tools for multinationalization don't really exist, per se, since their > > application is limited to language researchers and translators. The > > Huh? > > The Unicode 2.0 standard explicitly states multilingual computing > as the primary goal of the development effort. (First sentence in > section 1.1: Design Goals.) > > The problem with locales is that they address the operating > environment for software, but blindly assume it to be appropriate > for whatever data is encountered. Some dimensions of the locale > may remain "local", but other parts need to be driven by the > data, not the LANG environment variable. For well behaved MIME > mail messages this can work pretty well, but it does not work in > the general case. > > Unicode attempts to help out here by providing a locale > independent data coding scheme. With an en_US.ISO_8859-1 locale, > document in Russian (KOI8-R) cannot be properly processed. If I > want to index it, how do I know what codes constitute word > boundaries? What if I want to combine Russian and French in the > same index, or, heaven forbid, in the same document? Now, if I > had an en_US.UTF locale (I actually do, but it is little buggy) > and the Russian and French document was in unicode, I could > sensibly process it in a useful manner even though my preferred > locale was different. Unicode is a character encoding standard, not a font encoding standard; because of this, Unicode can not simultaneously represent Chinese and Japanese characters, it can only represent characters, period. The "Japanese-ness" or "Chinese-ness" of characters is a font property, not a character property. A character is not one of its possible glyphs. Likewise, Unicode can not encode the ligature relationships between code points for ligatured languages, such as Arabic, Aramaic, Sanskrit, Hebrew, Tamil, Devengari, other Indic languages, or even, to get to brass tacks, cursive English. It is a *character* encoding standard. The problem with representing multilingual documents is dealt with using compounding, in an implementation dependent fashion, to achieve font encoding. The compounding mechanism is beyond the scope of the Unicode standard. The closest you can get to multilingual support is to use a round trip character set with font assignments for code points. For example, the ISO 8859-1 (Latin-1) character set can support several languages at the same time for a given document; therefore, a Unicode document can represent those same languages because there are round-trip code-points for the characters in the 8859-1 standard. Likewise, JIS 208 + JIS 212 can wholly support 21 seperate languages. But you can not encode Chinese and Japanese simultaneously because there is no common character set, with a defined round trip mapping table, for doing that. > Multilingual applications limited to linguists? I suspect there > are plenty of people who know and use languages that don't share > the same character encoding. :) Unicode also provides a rich > assortment of other things useful regardless of your language. You misunderstand me... Just because Unicode is useless, by itself, for multilingual processing (it's a tool for localization of software to a specific round-trip locale, with no additional modifications of the software), does not mean that it is useless entirely. > How many times have you seen web pages with the telltale signs of > "smart quotes"? Box drawing characters that are portable across > platforms? Wheee! Math symbols? Lots of people could use a > richer set than + - / * and ^. You can't use Unicode for this... how can you attribute fonts on, for instance, a Japanese www page on Chinese poetry? Any character sets which have mutually unified code points that have different glyphs can not be simultaneously represented without font attribution. The Unicode standard is not a glyph encoding standard. > > best you can hope for is picking a single round-trip character set > > that supports both your languages. You will never find one of these > > for, for example, Chinese and Japanese. > > I gather it is possible to round-trip CJK conversions through > unicode by utilizing the private use area. I don't speak from > direct experience on this however. Yes and no. The private use area is too small for most scholarly texts, for instance, because the round trip would require nearly 20,000 private use characters (for instance, a side-by-side representation of Japanese and Chinese text in a Japanese textbook on "Chinese language for advanced linguists". Typical use for Unicode is for storage representation of locale specific data, such that the actual encoding doesn't vary from locale to locale. In other words, it's a tool for localizing to a single locale out of many possible locales, not for representing multiple locales simultaneously active (the input issues, alone, for something like that, would be prohibitive). The closest you could come would be a tool for a translator translating strings from one locale to another for the purpose of moving software into the new locale -- even then, you would probably implement as cooperating applications, instead of a single application, each in their own locale. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703121800.LAA27652>