From owner-freebsd-arch  Wed Feb 28 22: 2: 9 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178])
	by hub.freebsd.org (Postfix) with ESMTP
	id AE32C37B719; Wed, 28 Feb 2001 22:02:01 -0800 (PST)
	(envelope-from keichii@peorth.iteration.net)
Received: by peorth.iteration.net (Postfix, from userid 1001)
	id 96CE85955B; Thu,  1 Mar 2001 00:02:07 -0600 (CST)
Date: Thu, 1 Mar 2001 00:02:07 -0600
From: "Michael C . Wu" <keichii@iteration.net>
To: Terry Lambert <tlambert@primenet.com>
Cc: Jonathan Graehl <jonathan@graehl.org>,
	freebsd-Arch <freebsd-arch@FreeBSD.ORG>, i18n@freebsd.org
Subject: Re: Unicode, command line options, and configuration files, oh my!
Message-ID: <20010301000207.C4359@peorth.iteration.net>
Reply-To: "Michael C . Wu" <keichii@peorth.iteration.net>
Mail-Followup-To: "Michael C . Wu" <keichii@iteration.net>,
	Terry Lambert <tlambert@primenet.com>,
	Jonathan Graehl <jonathan@graehl.org>,
	freebsd-Arch <freebsd-arch@FreeBSD.ORG>, i18n@freebsd.org
References: <NCBBLOALCKKINBNNEDDLAELNDLAA.jonathan@graehl.org> <200103010541.WAA17385@usr05.primenet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <200103010541.WAA17385@usr05.primenet.com>; from tlambert@primenet.com on Thu, Mar 01, 2001 at 05:41:22AM +0000
X-PGP-Fingerprint: 5025 F691 F943 8128 48A8  5025 77CE 29C5 8FA1 2E20
X-PGP-Key-ID: 0x8FA12E20
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Use -i18n please. ")

On Thu, Mar 01, 2001 at 05:41:22AM +0000, Terry Lambert scribbled:
| [ ... Unicode ... ]
| 
| UTF encoded data is not fixed length in size.
| 
| POSIX specifies that file names can be up to 256 characters.
| 
| 256 characters UTF-8 encoded can vary from 256 to 1280
| characters.
|
| In general, this means that for Unicode data stored for
| directory entries would require that a directory entry
| block would have to be 512b, whereas for UTF-8, we are
| talking 2048b (2k).
| 
| If the same approach is used as the current UFS code uses,
| then these operations will need to be directory entry block
| atomic.

In short, we can save the file name that the user sees 
with the file data.  The filesystem and the kernel sees
some other naming scheme determined by the FS/kernel.

| FS stuff aside, most programs should use internal encoding.
| 
| For FS storage, fixed data records are also a problem, when
| using UTF-8 encoding.  The same goes for the ability to
| store fixed size input forms field data in databases, which
| like constraints set on record sizes.
| 
| 
| > There doesn't seem to be any impetus to systematically adopt
| > Unicode (especially the fixed-two-bytes-per-char variant,
| > which for most cases would simply double the storage/bandwidth
| > requirement), although there are user-applications which
| > operate on multibyte text.
| 
| UTF-8 is one character per byte for US ASCII, two bytes for
| the high page (128 characters) of ISO 8859-1, and three or more
| bytes for anything else.

Bad design. period.

| The idea that storage requirements increase is U.S. centric;
| all other character sets are penalized at least as much as if
| it were directly encoded instead of multibyte encoded, and
| the vast majority more penalized.

Yup, bad design. :)

| On top of that, we have Microsoft and Java interoperability to
| consider, distasteful as that may be to some.

M$ has a pretty good implementation here.
Java I18N sucks really bad.

| There's an interesting list of Unicode resources available at:
| http://www.unicode.org/unicode/onlinedat/products.html

-- 
+------------------------------------------------------------------+
| keichii@peorth.iteration.net         | keichii@bsdconspiracy.net |
| http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. |
+------------------------------------------------------------------+

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message