Date: Thu, 1 Mar 2001 00:02:07 -0600 From: "Michael C . Wu" <keichii@iteration.net> To: Terry Lambert <tlambert@primenet.com> Cc: Jonathan Graehl <jonathan@graehl.org>, freebsd-Arch <freebsd-arch@FreeBSD.ORG>, i18n@freebsd.org Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <20010301000207.C4359@peorth.iteration.net> In-Reply-To: <200103010541.WAA17385@usr05.primenet.com>; from tlambert@primenet.com on Thu, Mar 01, 2001 at 05:41:22AM %2B0000 References: <NCBBLOALCKKINBNNEDDLAELNDLAA.jonathan@graehl.org> <200103010541.WAA17385@usr05.primenet.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Use -i18n please. ") On Thu, Mar 01, 2001 at 05:41:22AM +0000, Terry Lambert scribbled: | [ ... Unicode ... ] | | UTF encoded data is not fixed length in size. | | POSIX specifies that file names can be up to 256 characters. | | 256 characters UTF-8 encoded can vary from 256 to 1280 | characters. | | In general, this means that for Unicode data stored for | directory entries would require that a directory entry | block would have to be 512b, whereas for UTF-8, we are | talking 2048b (2k). | | If the same approach is used as the current UFS code uses, | then these operations will need to be directory entry block | atomic. In short, we can save the file name that the user sees with the file data. The filesystem and the kernel sees some other naming scheme determined by the FS/kernel. | FS stuff aside, most programs should use internal encoding. | | For FS storage, fixed data records are also a problem, when | using UTF-8 encoding. The same goes for the ability to | store fixed size input forms field data in databases, which | like constraints set on record sizes. | | | > There doesn't seem to be any impetus to systematically adopt | > Unicode (especially the fixed-two-bytes-per-char variant, | > which for most cases would simply double the storage/bandwidth | > requirement), although there are user-applications which | > operate on multibyte text. | | UTF-8 is one character per byte for US ASCII, two bytes for | the high page (128 characters) of ISO 8859-1, and three or more | bytes for anything else. Bad design. period. | The idea that storage requirements increase is U.S. centric; | all other character sets are penalized at least as much as if | it were directly encoded instead of multibyte encoded, and | the vast majority more penalized. Yup, bad design. :) | On top of that, we have Microsoft and Java interoperability to | consider, distasteful as that may be to some. M$ has a pretty good implementation here. Java I18N sucks really bad. | There's an interesting list of Unicode resources available at: | http://www.unicode.org/unicode/onlinedat/products.html -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010301000207.C4359>