From owner-freebsd-arch Wed Feb 28 22: 2: 9 2001 Delivered-To: freebsd-arch@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id AE32C37B719; Wed, 28 Feb 2001 22:02:01 -0800 (PST) (envelope-from keichii@peorth.iteration.net) Received: by peorth.iteration.net (Postfix, from userid 1001) id 96CE85955B; Thu, 1 Mar 2001 00:02:07 -0600 (CST) Date: Thu, 1 Mar 2001 00:02:07 -0600 From: "Michael C . Wu" To: Terry Lambert Cc: Jonathan Graehl , freebsd-Arch , i18n@freebsd.org Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <20010301000207.C4359@peorth.iteration.net> Reply-To: "Michael C . Wu" Mail-Followup-To: "Michael C . Wu" , Terry Lambert , Jonathan Graehl , freebsd-Arch , i18n@freebsd.org References: <200103010541.WAA17385@usr05.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200103010541.WAA17385@usr05.primenet.com>; from tlambert@primenet.com on Thu, Mar 01, 2001 at 05:41:22AM +0000 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Use -i18n please. ") On Thu, Mar 01, 2001 at 05:41:22AM +0000, Terry Lambert scribbled: | [ ... Unicode ... ] | | UTF encoded data is not fixed length in size. | | POSIX specifies that file names can be up to 256 characters. | | 256 characters UTF-8 encoded can vary from 256 to 1280 | characters. | | In general, this means that for Unicode data stored for | directory entries would require that a directory entry | block would have to be 512b, whereas for UTF-8, we are | talking 2048b (2k). | | If the same approach is used as the current UFS code uses, | then these operations will need to be directory entry block | atomic. In short, we can save the file name that the user sees with the file data. The filesystem and the kernel sees some other naming scheme determined by the FS/kernel. | FS stuff aside, most programs should use internal encoding. | | For FS storage, fixed data records are also a problem, when | using UTF-8 encoding. The same goes for the ability to | store fixed size input forms field data in databases, which | like constraints set on record sizes. | | | > There doesn't seem to be any impetus to systematically adopt | > Unicode (especially the fixed-two-bytes-per-char variant, | > which for most cases would simply double the storage/bandwidth | > requirement), although there are user-applications which | > operate on multibyte text. | | UTF-8 is one character per byte for US ASCII, two bytes for | the high page (128 characters) of ISO 8859-1, and three or more | bytes for anything else. Bad design. period. | The idea that storage requirements increase is U.S. centric; | all other character sets are penalized at least as much as if | it were directly encoded instead of multibyte encoded, and | the vast majority more penalized. Yup, bad design. :) | On top of that, we have Microsoft and Java interoperability to | consider, distasteful as that may be to some. M$ has a pretty good implementation here. Java I18N sucks really bad. | There's an interesting list of Unicode resources available at: | http://www.unicode.org/unicode/onlinedat/products.html -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message