From owner-freebsd-arch Wed Feb 28 21:41:42 2001 Delivered-To: freebsd-arch@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id E9C9137B719 for ; Wed, 28 Feb 2001 21:41:39 -0800 (PST) (envelope-from tlambert@usr05.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id WAA01644; Wed, 28 Feb 2001 22:36:25 -0700 (MST) Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp05.primenet.com, id smtpdAAAIraOcd; Wed Feb 28 22:36:11 2001 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id WAA17385; Wed, 28 Feb 2001 22:41:22 -0700 (MST) From: Terry Lambert Message-Id: <200103010541.WAA17385@usr05.primenet.com> Subject: Re: Unicode, command line options, and configuration files, oh my! To: jonathan@graehl.org (Jonathan Graehl) Date: Thu, 1 Mar 2001 05:41:22 +0000 (GMT) Cc: freebsd-arch@FreeBSD.ORG (freebsd-Arch) In-Reply-To: from "Jonathan Graehl" at Feb 28, 2001 01:48:49 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [ ... Unicode ... ] UTF encoded data is not fixed length in size. POSIX specifies that file names can be up to 256 characters. 256 characters UTF-8 encoded can vary from 256 to 1280 characters. In general, this means that for Unicode data stored for directory entries would require that a directory entry block would have to be 512b, whereas for UTF-8, we are talking 2048b (2k). If the same approach is used as the current UFS code uses, then these operations will need to be directory entry block atomic. FS stuff aside, most programs should use internal encoding. For FS storage, fixed data records are also a problem, when using UTF-8 encoding. The same goes for the ability to store fixed size input forms field data in databases, which like constraints set on record sizes. > There doesn't seem to be any impetus to systematically adopt > Unicode (especially the fixed-two-bytes-per-char variant, > which for most cases would simply double the storage/bandwidth > requirement), although there are user-applications which > operate on multibyte text. UTF-8 is one character per byte for US ASCII, two bytes for the high page (128 characters) of ISO 8859-1, and three or more bytes for anything else. The idea that storage requirements increase is U.S. centric; all other character sets are penalized at least as much as if it were directly encoded instead of multibyte encoded, and the vast majority more penalized. On top of that, we have Microsoft and Java interoperability to consider, distasteful as that may be to some. There's an interesting list of Unicode resources available at: http://www.unicode.org/unicode/onlinedat/products.html Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message