From owner-freebsd-hackers Thu Jun 11 14:58:00 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id OAA09995 for freebsd-hackers-outgoing; Thu, 11 Jun 1998 14:58:00 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp04.primenet.com (daemon@smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA09952 for ; Thu, 11 Jun 1998 14:57:12 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id OAA27166; Thu, 11 Jun 1998 14:56:51 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpd027150; Thu Jun 11 14:56:48 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id OAA26420; Thu, 11 Jun 1998 14:56:40 -0700 (MST) From: Terry Lambert Message-Id: <199806112156.OAA26420@usr09.primenet.com> Subject: Re: internationalization To: itojun@itojun.org (Jun-ichiro itojun Itoh) Date: Thu, 11 Jun 1998 21:56:40 +0000 (GMT) Cc: hackers@FreeBSD.ORG In-Reply-To: <6351.897526003@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 09:46:43 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > >> I would prefer going to a full-on Unicode implementation to support > >> all known human languages. > > This was my first leaning, but I'm increasingly > > going toward the ISO families. > > Yes, iso-2022 families are quite important for supporting > asian languages. Unicode is, for us Japanese, quite incomplete and > unexpandable. There are valid objections to Unicode, but they are couched in technical issues that do not apply to Japanese information processing, and so they are issues which are not raised by the Japanese as objections. These issues are: 1) There is an inherent bias against fixed cell rendering technologies in the Unicode standard. Specifically, there is an apparent bias toward requiring the display system to contain a proprietary rendereing technology, with a specific bias towards PostScript and related technologies that resultin licensing fees being paid to consortium members. This bias exists in ligatured languages -- that is it exists in alphabetic languages, not ideogrammatic languages, like Japanese. The problem is that ligatures change the glyph rendering, and there are not interspersed "private use" code points that can be overloaded in order to greate a fixed font rendering that doesn't depend on processing the ligatures at the rendering device. This makes it difficult to support ligatured languages on X devices. Examples of Ligatured languages: Tamil, Devengari, Arabic, script Hebrew, script English, script German, etc.. This issue can be worked around, either by "caving in" and paying the license fees for PostScript, or by doing a lot of work (like "xtamil" demonstrates). 2) The use of 16 bit rather than 8 bit characters introduces synchronization issues for ttys, pty's, pipes, serial ports, byte-stream files, and other byte-stream oriented devices. This is resolvable through the use of wchar_t internally, and the use of reliable delivery protocol encapsulation of the byte-streams, externally. 3) The common recommended encoding (generally espoused by the US-ASCII using Unicode Consortium members) is UTF-7/UTF-8, on the theory that existing ASCII documents will not need conversion and/or attribution. This breaks fixed field length input mechanisms, fixed field record implementations, character, rather than byte, input method mechanisms (such as used by X). It breaks the ability to do record counting using file size divided by record size. It breaks the utility of the ability to memory map files. It damages compressability. It weakens cryptographic standards by providing another vector for statistical analysis based on common prefix bit patterns. It complicates greatly most word counting mechanisms, most protocol-based content interchanges, and any other places where the encoding must be converted into an internal representation. It increases all processing overhead due to the need to convert between the encoded form and the more useful to computing tasks "raw" representation. This is resolvable by storing the raw representation rather than the encoded form, despite the ASCII-bigots objections. The Japanese don't have a ligatured language, they don't use anything but byte-encoded data, and they are already used to putting up with the slings and arrows associated with indeterminate storage encoding length. The main arguments that have been put forth by the Japanese representatives to the Unicode Consortium are rather specious: 1) You can't simultaneously represent text that needs to be rendered with alternate glyphs but which has unified code points. This is a valid criticism, if what you are building is translation workbenches between languages which do not have a common character set, or engaging in linguistic scholarship. This same criticism, however, is just as valid when you level it against ISO 2022. The answer is to use a markup language of some kind to do the font selection. For example, SGML, or any of a dozen SGML DTD's (such as O'Reilly's "DocBook"). So while the criticism is valid, no other standard has been suggested as a workaround for inband representation of character set selection. It seems to me that the common opinion the other consortium members is that this is a straw man in support of other less rational objections. 2) You can't seperate document content based on language, given only a raw Unicode document. This is a valid criticism as well, if what you are building is translation workbenches between languages which do not have a common character set, or engaging in linguistic scholarship. Once again, the criticism is equally valid against all other standards, and no suggestion has been made to resolve it, save the use of a markup language. It seems to me that the common opinion the other consortium members is that this is a straw man in support of the irrational desire to be able to "grep -v" out all text in a compound document that is not Japanese text. That is, the opinion is that there is no technical basis for this objection. 3) The lexical sequence of the character sets are classified in what has been termed "Chinese Dictionary Order". This criticism is based on the irrational fear that the Japanese text processing is somehow disadvantaged compared to other nationalities, specifically Chinese, when it comes to being able to use the ordinal value of the character to do sorting. This objection is irrational for a number of reasons: a) The ordering is "stroke-radical"; this means that the order is *not* sufficient for correct lexical ordering of Chinese. b) Japan has two dictionary orders. It is impossible to select a single order and thus silence every possible Japanese objection. c) Code page 0/8 of the Unicode standard (0/0/0/8 of ISO-10646) is in ISO 8859-1 order; Japan is thus not the only country which must employ seperate collation tables: i) Countries whose native character set is ISO 8859-X, where X != 1, must use a seperate table. ii) Countries whose native character set is a defacto rather than an ISO standard (such as the former Russian Republics KOI-8 and KOI-8U) must use a seprate table. iii) Countries where there are multiple lexical orderings (such as German telephone book vs. German Dictionary ordering of the Sigma character) must use a seperate table. iv) Countries that have problems to solve that occur only in alphabetic languages and not in ideogrammatic languages (such as case insensitive collation in the United States) must use a sperate table. d) The JIS-208 ordering is not altered by the JIS-212 extensions. The Japanese representitives have not suggested an alternate character classification algorithm that can encompass the unified glyphs, including the non-Japanese glyphs in the CJK unification, yet still result in JIS-208 lexical ordering of purely Japanese text. In other words, they haven't solved the problem, yet they refuse to let anyone else solve it in a way which conflicts with existing partial soloutions of Japanese origin. Unicode is a tool for Internationalization. Internationalization is the process of creating code that allows data-driven localization to a single locale, or, more broadly, to a single round-trip character set. Internationalization is *NOT* the process of creating code to that can simultaneously process documents containing text in several non-subset round-trip character sets (for example, a Japanese language teaching text written in the Indic script Devengari, or in Arabic). That process is called "multinationalization". The utility of multinationalized software is limited to linguistic scholarship and human translation processing, and similar pursiuts. It is an acceptable trade-off to require the authors of such tools to bear the additional cost of processing a markup language, in order to simplify the requirements for the *VAST* majority of applciations that do not require multinationaliztion. If mutlinationalization is truly the real issue for the Japanese, or anyone else for that matter, they are free to petition the ISO for allocation of ISO 10646 code pages other than page 0, which is now allocated for use by Unicode. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message