From owner-freebsd-hackers  Thu Jun 11 14:58:00 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id OAA09995
          for freebsd-hackers-outgoing; Thu, 11 Jun 1998 14:58:00 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp04.primenet.com (daemon@smtp04.primenet.com [206.165.6.134])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA09952
          for <hackers@FreeBSD.ORG>; Thu, 11 Jun 1998 14:57:12 -0700 (PDT)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.8.8/8.8.8) id OAA27166;
	Thu, 11 Jun 1998 14:56:51 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp04.primenet.com, id smtpd027150; Thu Jun 11 14:56:48 1998
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id OAA26420;
	Thu, 11 Jun 1998 14:56:40 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199806112156.OAA26420@usr09.primenet.com>
Subject: Re: internationalization
To: itojun@itojun.org (Jun-ichiro itojun Itoh)
Date: Thu, 11 Jun 1998 21:56:40 +0000 (GMT)
Cc: hackers@FreeBSD.ORG
In-Reply-To: <6351.897526003@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 09:46:43 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> >> I would prefer going to a full-on Unicode implementation to support
> >> all known human languages.
> >		This was my first leaning, but I'm increasingly
> >		going toward the ISO families.
> 
> 	Yes, iso-2022 families are quite important for supporting
> 	asian languages.  Unicode is, for us Japanese, quite incomplete and
> 	unexpandable.


There are valid objections to Unicode, but they are couched in technical
issues that do not apply to Japanese information processing, and so they
are issues which are not raised by the Japanese as objections.

These issues are:

1)	There is an inherent bias against fixed cell rendering
	technologies in the Unicode standard.

	Specifically, there is an apparent bias toward requiring the
	display system to contain a proprietary rendereing technology,
	with a specific bias towards PostScript and related technologies
	that resultin licensing fees being paid to consortium members.

	This bias exists in ligatured languages -- that is it exists
	in alphabetic languages, not ideogrammatic languages, like
	Japanese.

	The problem is that ligatures change the glyph rendering, and
	there are not interspersed "private use" code points that can
	be overloaded in order to greate a fixed font rendering that
	doesn't depend on processing the ligatures at the rendering
	device.

	This makes it difficult to support ligatured languages on X
	devices.

	Examples of Ligatured languages:  Tamil, Devengari, Arabic,
	script Hebrew, script English, script German, etc..

	This issue can be worked around, either by "caving in" and
	paying the license fees for PostScript, or by doing a lot
	of work (like "xtamil" demonstrates).

2)	The use of 16 bit rather than 8 bit characters introduces
	synchronization issues for ttys, pty's, pipes, serial ports,
	byte-stream files, and other byte-stream oriented devices.

	This is resolvable through the use of wchar_t internally,
	and the use of reliable delivery protocol encapsulation of
	the byte-streams, externally.

3)	The common recommended encoding (generally espoused by the
	US-ASCII using Unicode Consortium members) is UTF-7/UTF-8,
	on the theory that existing ASCII documents will not need
	conversion and/or attribution.

	This breaks fixed field length input mechanisms, fixed field
	record implementations, character, rather than byte, input
	method mechanisms (such as used by X).

	It breaks the ability to do record counting using file size
	divided by record size.

	It breaks the utility of the ability to memory map files.

	It damages compressability.

	It weakens cryptographic standards by providing another
	vector for statistical analysis based on common prefix
	bit patterns.

	It complicates greatly most word counting mechanisms, most
	protocol-based content interchanges, and any other places
	where the encoding must be converted into an internal
	representation.

	It increases all processing overhead due to the need to
	convert between the encoded form and the more useful to
	computing tasks "raw" representation.

	This is resolvable by storing the raw representation rather
	than the encoded form, despite the ASCII-bigots objections.

The Japanese don't have a ligatured language, they don't use anything
but byte-encoded data, and they are already used to putting up with
the slings and arrows associated with indeterminate storage encoding
length.


The main arguments that have been put forth by the Japanese
representatives to the Unicode Consortium are rather specious:

1)	You can't simultaneously represent text that needs to be
	rendered with alternate glyphs but which has unified
	code points.

	This is a valid criticism, if what you are building is
	translation workbenches between languages which do not
	have a common character set, or engaging in linguistic
	scholarship.

	This same criticism, however, is just as valid when you
	level it against ISO 2022.

	The answer is to use a markup language of some kind to
	do the font selection.  For example, SGML, or any of a
	dozen SGML DTD's (such as O'Reilly's "DocBook").

	So while the criticism is valid, no other standard has
	been suggested as a workaround for inband representation
	of character set selection.

	It seems to me that the common opinion the other consortium
	members is that this is a straw man in support of other
	less rational objections.

2)	You can't seperate document content based on language,
	given only a raw Unicode document.

	This is a valid criticism as well, if what you are building
	is translation workbenches between languages which do not
	have a common character set, or engaging in linguistic
	scholarship.

	Once again, the criticism is equally valid against all other
	standards, and no suggestion has been made to resolve it,
	save the use of a markup language.

	It seems to me that the common opinion the other consortium
	members is that this is a straw man in support of the
	irrational desire to be able to "grep -v" out all text in
	a compound document that is not Japanese text.

	That is, the opinion is that there is no technical basis
	for this objection.


3)	The lexical sequence of the character sets are classified in
	what has been termed "Chinese Dictionary Order".

	This criticism is based on the irrational fear that the
	Japanese text processing is somehow disadvantaged compared
	to other nationalities, specifically Chinese, when it comes
	to being able to use the ordinal value of the character to
	do sorting.

	This objection is irrational for a number of reasons:

	a)	The ordering is "stroke-radical"; this means that
		the order is *not* sufficient for correct lexical
		ordering of Chinese.

	b)	Japan has two dictionary orders.  It is impossible
		to select a single order and thus silence every
		possible Japanese objection.

	c)	Code page 0/8 of the Unicode standard (0/0/0/8 of
		ISO-10646) is in ISO 8859-1 order; Japan is thus
		not the only country which must employ seperate
		collation tables:

		i)	Countries whose native character set is ISO
			8859-X, where X != 1, must use a seperate
			table.

		ii)	Countries whose native character set is a
			defacto rather than an ISO standard (such
			as the former Russian Republics KOI-8 and
			KOI-8U) must use a seprate table.

		iii)	Countries where there are multiple lexical
			orderings (such as German telephone book
			vs. German Dictionary ordering of the Sigma
			character) must use a seperate table.

		iv)	Countries that have problems to solve that
			occur only in alphabetic languages and not
			in ideogrammatic languages (such as case
			insensitive collation in the United States)
			must use a sperate table.

	d)	The JIS-208 ordering is not altered by the JIS-212
		extensions.

	The Japanese representitives have not suggested an alternate
	character classification algorithm that can encompass the unified
	glyphs, including the non-Japanese glyphs in the CJK unification,
	yet still result in JIS-208 lexical ordering of purely Japanese
	text.

In other words, they haven't solved the problem, yet they refuse to let
anyone else solve it in a way which conflicts with existing partial
soloutions of Japanese origin.


Unicode is a tool for Internationalization.  Internationalization is
the process of creating code that allows data-driven localization to
a single locale, or, more broadly, to a single round-trip character set.

Internationalization is *NOT* the process of creating code to that can
simultaneously process documents containing text in several non-subset
round-trip character sets (for example, a Japanese language teaching
text written in the Indic script Devengari, or in Arabic).

That process is called "multinationalization".


The utility of multinationalized software is limited to linguistic
scholarship and human translation processing, and similar pursiuts.

It is an acceptable trade-off to require the authors of such tools
to bear the additional cost of processing a markup language, in order
to simplify the requirements for the *VAST* majority of applciations
that do not require multinationaliztion.


If mutlinationalization is truly the real issue for the Japanese, or
anyone else for that matter, they are free to petition the ISO for
allocation of ISO 10646 code pages other than page 0, which is now
allocated for use by Unicode.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message