From owner-freebsd-arch  Wed Feb 28 21:41:42 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id E9C9137B719
	for <freebsd-arch@FreeBSD.ORG>; Wed, 28 Feb 2001 21:41:39 -0800 (PST)
	(envelope-from tlambert@usr05.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id WAA01644;
	Wed, 28 Feb 2001 22:36:25 -0700 (MST)
Received: from usr05.primenet.com(206.165.6.205)
 via SMTP by smtp05.primenet.com, id smtpdAAAIraOcd; Wed Feb 28 22:36:11 2001
Received: (from tlambert@localhost)
	by usr05.primenet.com (8.8.5/8.8.5) id WAA17385;
	Wed, 28 Feb 2001 22:41:22 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200103010541.WAA17385@usr05.primenet.com>
Subject: Re: Unicode, command line options, and configuration files, oh my!
To: jonathan@graehl.org (Jonathan Graehl)
Date: Thu, 1 Mar 2001 05:41:22 +0000 (GMT)
Cc: freebsd-arch@FreeBSD.ORG (freebsd-Arch)
In-Reply-To: <NCBBLOALCKKINBNNEDDLAELNDLAA.jonathan@graehl.org> from "Jonathan Graehl" at Feb 28, 2001 01:48:49 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

[ ... Unicode ... ]

UTF encoded data is not fixed length in size.

POSIX specifies that file names can be up to 256 characters.

256 characters UTF-8 encoded can vary from 256 to 1280
characters.

In general, this means that for Unicode data stored for
directory entries would require that a directory entry
block would have to be 512b, whereas for UTF-8, we are
talking 2048b (2k).

If the same approach is used as the current UFS code uses,
then these operations will need to be directory entry block
atomic.

FS stuff aside, most programs should use internal encoding.

For FS storage, fixed data records are also a problem, when
using UTF-8 encoding.  The same goes for the ability to
store fixed size input forms field data in databases, which
like constraints set on record sizes.


> There doesn't seem to be any impetus to systematically adopt
> Unicode (especially the fixed-two-bytes-per-char variant,
> which for most cases would simply double the storage/bandwidth
> requirement), although there are user-applications which
> operate on multibyte text.

UTF-8 is one character per byte for US ASCII, two bytes for
the high page (128 characters) of ISO 8859-1, and three or more
bytes for anything else.

The idea that storage requirements increase is U.S. centric;
all other character sets are penalized at least as much as if
it were directly encoded instead of multibyte encoded, and
the vast majority more penalized.

On top of that, we have Microsoft and Java interoperability to
consider, distasteful as that may be to some.

There's an interesting list of Unicode resources available at:
http://www.unicode.org/unicode/onlinedat/products.html


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message