From owner-freebsd-hackers  Wed Jun 10 22:47:54 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id WAA26244
          for freebsd-hackers-outgoing; Wed, 10 Jun 1998 22:47:54 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from newserv.urc.ac.ru (newserv.urc.ac.ru [193.233.85.48])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id WAA26127
          for <hackers@freebsd.org>; Wed, 10 Jun 1998 22:47:16 -0700 (PDT)
          (envelope-from joy@urc.ac.ru)
Received: from urc.ac.ru (y.urc.ac.ru [193.233.85.37])
	by newserv.urc.ac.ru (8.8.8/8.8.8) with ESMTP id LAA27440;
	Thu, 11 Jun 1998 11:41:33 +0600 (ESS)
	(envelope-from joy@urc.ac.ru)
Message-ID: <357F6E0D.FE51B0B2@urc.ac.ru>
Date: Thu, 11 Jun 1998 11:41:33 +0600
From: Konstantin Chuguev <joy@urc.ac.ru>
Organization: South Ural Regional Center of FREEnet
X-Mailer: Mozilla 4.05 [en] (X11; I; FreeBSD 3.0-CURRENT i386)
MIME-Version: 1.0
To: Terry Lambert <tlambert@primenet.com>
CC: Gary Kline <kline@tao.thought.org>, hackers@FreeBSD.ORG
Subject: Re: internationalization
References: <199806102155.OAA13862@usr01.primenet.com>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Terry Lambert wrote:
> 
> I would prefer going to a full-on Unicode implementation to support
> all known human languages.
> 
I agree at least because Unicode is not just a character set or
subset of ISO 10646, but a database of character mnemonic names,
collation rules, bidirectional writing, uppercasing, lowercasing,
transliteration rules. This has huge importance in text processing.

I am afraid ISO 2022 lacks these capabilities.

Another thing is very simple conversion mechanism between
UTF-8 and UCS-16/32, i.e. multibyte and wide character encodings.

We need both encodings: the first for ASCII compatibility (and
C zero-byte ended (char *)strings compatibility) and the second
for fast searching/sorting.

> I would suggest an initial 16 bit wchar_t with an assumption of a
> zero valued code page designator.  If ISO ever gets around to adding
> other code pages, we can deal with that at that time using page
> selection.  Meanwhile, we'll be able to interportate with Microsoft
> and JAVA, which use 16 bit wchar_t encodings.
> 
> I think the first (and hardest) step is the shells.  The shells need
> to be internationalized based on the fact that they (can) intrpret
> exit codes to the user as error messages.
> 
> The last time I converted csh, this was absolute hell because the
> code was badly organized for internationalization.
> 
> The next hardest step is the editors, starting with "vi".  They have
> to be able to support Unicode.
> 
That consists of 2 levels: character set level (wchar, mbyte,
conversion,
locale's LANG etc.) and message catalogues (locale's LANG). IMO,
the second should be done only after the first is precisely developed.

> I have had FS-based Unicode support working for a very long time,
> though it has failed to be committed.  One big issue is that directory
> entry blocks must grow from 512b to 1k.  This has a number of
> implications to the soft updates work currently in progress.  This is
> because, in order to support a maximally sized path component, 512 + 24
> bytes is needed for unicaode, as opposed to 256 + 24 (which fits in 512b)
> for an 8 bit charaacter set.
> 
Do you mean processing UCS-16 in the kernel (FS-level)?
I'm asking about it because any application is expecting 8-bit character
zero-ended strings as file names. It does not matter if it is ASCII or
any multibyte charset. So then we need a conversion between UCS-16
and UTF-8 (or probably locale's charset) in the kernel.

> If we were to do something stupid, like UTF-7 or UTF-8, it would have
> to grow to 5 * 256 + 24, minimally, to support 5:1 character expansion
> possible, as opposed to the 2:1 of flat Unicode encoding.
> 
> For character set attributed FS's (like NFS v2/v3 will have to be), you
> can do the translation in in the kernel on the blocks on their way out
> (a 2:1 expnasion in memory of a 1:1 disk image for a given ISO character
> set attribution for the filesystem).
> 
Another reason for including conversion routines into the kernel.

--
	Konstantin V. Chuguev.		System administrator of Southern
	http://www.urc.ac.ru/~joy/	Ural Regional Center of FREEnet,
	mailto:joy@urc.ac.ru		Chelyabinsk, Russia.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message