From owner-freebsd-i18n Wed Feb 28 21:59:27 2001 Delivered-To: freebsd-i18n@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id B7BC537B718; Wed, 28 Feb 2001 21:59:19 -0800 (PST) (envelope-from keichii@peorth.iteration.net) Received: by peorth.iteration.net (Postfix, from userid 1001) id 445625955B; Wed, 28 Feb 2001 23:59:25 -0600 (CST) Date: Wed, 28 Feb 2001 23:59:25 -0600 From: "Michael C . Wu" To: Jonathan Graehl Cc: freebsd-Arch , i18n@freebsd.org Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <20010228235925.B4359@peorth.iteration.net> Reply-To: "Michael C . Wu" Mail-Followup-To: "Michael C . Wu" , Jonathan Graehl , freebsd-Arch , i18n@freebsd.org References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from jonathan@graehl.org on Wed, Feb 28, 2001 at 01:48:49PM -0800 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG People, there is an freebsd-i18n@freebsd.org for a reason. On Wed, Feb 28, 2001 at 01:48:49PM -0800, Jonathan Graehl scribbled: | How much change would be needed to have a Unicode-capable FreeBSD system? A lot. and a lot more. | Supposing the variable-length encoding is used, all existing text output, | filenames, and string-based kernel interfaces should be compliant (although not No, they are not that easy. | capable of understanding multiple-byte-char input/output); would command line | options be passed as byte-strings by a Unicode-capable shell? No. | There doesn't seem to be any impetus to systematically adopt Unicode (especially | the fixed-two-bytes-per-char variant, which for most cases would simply double | the storage/bandwidth requirement), although there are user-applications which Not that easy :) Trust me. | operate on multibyte text. I am sure that by now admins and programmers in | country XYZ are used to working with ASCII and pseudo-English (no matter how | inconvenient it might be to generate from their keyboards). It is the "assuming" part that got us in this I18N dilemma. | [snip XML] I really do not think using XML is the way to go, too much crud. The K.I.S.S. principle should prevail here, especially in kernelland. | Parsing of command line options (and positional parameters) is also largely | ad-hoc. Looking through /usr/src, I see that for the most case, it consists of | a getopt loop with hand-coded cases, a hand-written usage string, and a | hand-written man-page-usage. Much like the XML DTD, it would make sense to | generically specify (to the extent possible, and with user-defined code to the | extent not) the syntax and semantics, and generate variable definitions, | parsing/checking code, usage(), man page synopsis ... While it would be Do you realize that this means a rewrite for the 300mb of the src/ that we have now? | possible to have an expressive grammar for command line options, typically | the -opts are order-independent, and there are only a few positional parameters | (or else you put the mess into a configuration file). There are a variety of | packages out there, which I am seeking opinions on, not having tried any of | them: | [snip *freshmeat* stuff] I have looked at those, not suitable, and they are GPL. | any others? | | ifconfig seemed to have one of the more enlightened-looking option parsers (an | array of parameter information processed in a loop, rather than a bunch of Because it needs to parse many many things. But why do you need so called "smart" parsers when you only have one or two options to parse? | hard-coded cases) out of several FreeBSD programs I examined ... are there any | other good examples? ipfilter. | It's also amusing to see how many different ways various servers in the tree can | open a configuration file (path read from command line), write a pid file (path | read from command line), daemonize, read an IP address/hostname and port (read | from command line) and listen there, mask nonfatal signals, relinquish It happens in a large code base. However, to rewrite all of that takes many many man-hours. I really do not think we are up to that. | priveleges - although I appreciate that different servers want to do things | slightly differently. Naturally, each of us is easily able to reuse our own | code (preferably by libraries/macros/#include rather than copy/paste), but I | think that there is a lot of common configuration/command-line code that could | be coalesced behind a good-enough-extensible interface that we could reuse code Glad to hear that people care about I18N. -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Wed Feb 28 22: 2: 8 2001 Delivered-To: freebsd-i18n@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id AE32C37B719; Wed, 28 Feb 2001 22:02:01 -0800 (PST) (envelope-from keichii@peorth.iteration.net) Received: by peorth.iteration.net (Postfix, from userid 1001) id 96CE85955B; Thu, 1 Mar 2001 00:02:07 -0600 (CST) Date: Thu, 1 Mar 2001 00:02:07 -0600 From: "Michael C . Wu" To: Terry Lambert Cc: Jonathan Graehl , freebsd-Arch , i18n@freebsd.org Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <20010301000207.C4359@peorth.iteration.net> Reply-To: "Michael C . Wu" Mail-Followup-To: "Michael C . Wu" , Terry Lambert , Jonathan Graehl , freebsd-Arch , i18n@freebsd.org References: <200103010541.WAA17385@usr05.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200103010541.WAA17385@usr05.primenet.com>; from tlambert@primenet.com on Thu, Mar 01, 2001 at 05:41:22AM +0000 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Use -i18n please. ") On Thu, Mar 01, 2001 at 05:41:22AM +0000, Terry Lambert scribbled: | [ ... Unicode ... ] | | UTF encoded data is not fixed length in size. | | POSIX specifies that file names can be up to 256 characters. | | 256 characters UTF-8 encoded can vary from 256 to 1280 | characters. | | In general, this means that for Unicode data stored for | directory entries would require that a directory entry | block would have to be 512b, whereas for UTF-8, we are | talking 2048b (2k). | | If the same approach is used as the current UFS code uses, | then these operations will need to be directory entry block | atomic. In short, we can save the file name that the user sees with the file data. The filesystem and the kernel sees some other naming scheme determined by the FS/kernel. | FS stuff aside, most programs should use internal encoding. | | For FS storage, fixed data records are also a problem, when | using UTF-8 encoding. The same goes for the ability to | store fixed size input forms field data in databases, which | like constraints set on record sizes. | | | > There doesn't seem to be any impetus to systematically adopt | > Unicode (especially the fixed-two-bytes-per-char variant, | > which for most cases would simply double the storage/bandwidth | > requirement), although there are user-applications which | > operate on multibyte text. | | UTF-8 is one character per byte for US ASCII, two bytes for | the high page (128 characters) of ISO 8859-1, and three or more | bytes for anything else. Bad design. period. | The idea that storage requirements increase is U.S. centric; | all other character sets are penalized at least as much as if | it were directly encoded instead of multibyte encoded, and | the vast majority more penalized. Yup, bad design. :) | On top of that, we have Microsoft and Java interoperability to | consider, distasteful as that may be to some. M$ has a pretty good implementation here. Java I18N sucks really bad. | There's an interesting list of Unicode resources available at: | http://www.unicode.org/unicode/onlinedat/products.html -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Wed Feb 28 22:45:20 2001 Delivered-To: freebsd-i18n@freebsd.org Received: from areilly.bpc-users.org (CPE-144-132-234-126.nsw.bigpond.net.au [144.132.234.126]) by hub.freebsd.org (Postfix) with SMTP id 8D1CA37B719 for ; Wed, 28 Feb 2001 22:45:14 -0800 (PST) (envelope-from areilly@bigpond.net.au) Received: (qmail 65096 invoked by uid 1000); 1 Mar 2001 06:45:13 -0000 From: "Andrew Reilly" Date: Thu, 1 Mar 2001 17:45:13 +1100 To: "Michael C . Wu" Cc: Terry Lambert , Jonathan Graehl , freebsd-Arch , i18n@FreeBSD.ORG Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <20010301174513.A65013@gurney.reilly.home> References: <200103010541.WAA17385@usr05.primenet.com> <20010301000207.C4359@peorth.iteration.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010301000207.C4359@peorth.iteration.net>; from keichii@iteration.net on Thu, Mar 01, 2001 at 12:02:07AM -0600 Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Thu, Mar 01, 2001 at 12:02:07AM -0600, Michael C . Wu wrote: > Terry wrote: > | In general, this means that for Unicode data stored for > | directory entries would require that a directory entry > | block would have to be 512b, whereas for UTF-8, we are > | talking 2048b (2k). It would still have to be larger than 512b using a 16-bit encoding, wouldn't it? > | If the same approach is used as the current UFS code uses, > | then these operations will need to be directory entry block > | atomic. > > In short, we can save the file name that the user sees > with the file data. The filesystem and the kernel sees > some other naming scheme determined by the FS/kernel. How do you propose to do that and still maintain Unix inode/link semantics? There isn't (necessarily) only one file name that the user sees, but there _is_ only one lump of file data. > | On top of that, we have Microsoft and Java interoperability to > | consider, distasteful as that may be to some. > > M$ has a pretty good implementation here. > Java I18N sucks really bad. Could you give a quick description of why one of these is good and the other bad, for the bennefit of someone who knows neither? -- Andrew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Thu Mar 1 7:50:46 2001 Delivered-To: freebsd-i18n@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id AEFAA37B718; Thu, 1 Mar 2001 07:50:42 -0800 (PST) (envelope-from keichii@peorth.iteration.net) Received: by peorth.iteration.net (Postfix, from userid 1001) id 59BA95955D; Thu, 1 Mar 2001 09:50:49 -0600 (CST) Date: Thu, 1 Mar 2001 09:50:49 -0600 From: "Michael C . Wu" To: Andrew Reilly Cc: Terry Lambert , Jonathan Graehl , asmodai@FreeBSD.ORG, i18n@FreeBSD.ORG Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <20010301095049.A10822@peorth.iteration.net> Reply-To: "Michael C . Wu" References: <200103010541.WAA17385@usr05.primenet.com> <20010301000207.C4359@peorth.iteration.net> <20010301174513.A65013@gurney.reilly.home> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010301174513.A65013@gurney.reilly.home>; from areilly@bigpond.net.au on Thu, Mar 01, 2001 at 05:45:13PM +1100 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Thu, Mar 01, 2001 at 05:45:13PM +1100, Andrew Reilly scribbled: | On Thu, Mar 01, 2001 at 12:02:07AM -0600, Michael C . Wu wrote: | > Terry wrote: | > | In general, this means that for Unicode data stored for | > | directory entries would require that a directory entry | > | block would have to be 512b, whereas for UTF-8, we are | > | talking 2048b (2k). | | It would still have to be larger than 512b using a 16-bit | encoding, wouldn't it? Yes, and if we are making it larger than 512b, why do we need to set a limit on ourselves? | > | If the same approach is used as the current UFS code uses, | > | then these operations will need to be directory entry block | > | atomic. | > | > In short, we can save the file name that the user sees | > with the file data. The filesystem and the kernel sees | > some other naming scheme determined by the FS/kernel. | | How do you propose to do that and still maintain Unix inode/link | semantics? There isn't (necessarily) only one file name that | the user sees, but there _is_ only one lump of file data. Do you see why nobody has been able to solve all this stuff easily? I think having a journaling filesystem could solve this. | > | On top of that, we have Microsoft and Java interoperability to | > | consider, distasteful as that may be to some. | > | > M$ has a pretty good implementation here. | > Java I18N sucks really bad. | | Could you give a quick description of why one of these is good | and the other bad, for the bennefit of someone who knows | neither? NTFS gives up the ability to switch charsets in the harddrives. (It is a pretty good assumption, since most users stay within two languages.) And most of the userland tools, even the simple ones, work with other languages without modifications, when compiled by Visual Studio. Java uses a weird scheme to negotiate the contents, where the server and the client both have to agree in the charset. Then you have to wrap strings in special functions. Then you have to specifically tell java that the input is "international" input. bla bla bla....Generally bad design and a big hassle. (Have you ever seen a Chinese/Japanese/Korean java-enabled website that _works_? I have seen very very few.) -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Thu Mar 1 13: 0: 8 2001 Delivered-To: freebsd-i18n@freebsd.org Received: from smtp10.phx.gblx.net (smtp10.phx.gblx.net [206.165.6.140]) by hub.freebsd.org (Postfix) with ESMTP id 9554437B719; Thu, 1 Mar 2001 12:59:58 -0800 (PST) (envelope-from tlambert@usr05.primenet.com) Received: (from daemon@localhost) by smtp10.phx.gblx.net (8.9.3/8.9.3) id NAA76472; Thu, 1 Mar 2001 13:59:35 -0700 Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp10.phx.gblx.net, id smtpdem4Fqa; Thu Mar 1 13:59:24 2001 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id NAA05439; Thu, 1 Mar 2001 13:59:43 -0700 (MST) From: Terry Lambert Message-Id: <200103012059.NAA05439@usr05.primenet.com> Subject: Re: Unicode, command line options, and configuration files, oh my! To: areilly@bigpond.net.au (Andrew Reilly) Date: Thu, 1 Mar 2001 20:59:43 +0000 (GMT) Cc: keichii@peorth.iteration.net (Michael C . Wu), tlambert@primenet.com (Terry Lambert), jonathan@graehl.org (Jonathan Graehl), freebsd-arch@FreeBSD.ORG (freebsd-Arch), i18n@FreeBSD.ORG In-Reply-To: <20010301174513.A65013@gurney.reilly.home> from "Andrew Reilly" at Mar 01, 2001 05:45:13 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > | In general, this means that for Unicode data stored for > > | directory entries would require that a directory entry > > | block would have to be 512b, whereas for UTF-8, we are > > | talking 2048b (2k). > > It would still have to be larger than 512b using a 16-bit > encoding, wouldn't it? Yes; 1024b; sorry about that, it was an error. The point was supposed to be that, if you go look at the directory entry code, it would be a lot easier to implement 1k instead of 2k (we did this before when we ported the FreeBSD VFS to Windows 95 and supported both the 256 character Unicode and the 8.3 namespaces simultaneously). > > | If the same approach is used as the current UFS code uses, > > | then these operations will need to be directory entry block > > | atomic. > > > > In short, we can save the file name that the user sees > > with the file data. The filesystem and the kernel sees > > some other naming scheme determined by the FS/kernel. > > How do you propose to do that and still maintain Unix inode/link > semantics? There isn't (necessarily) only one file name that > the user sees, but there _is_ only one lump of file data. How do hard links work at all today, under the same conditions? The directory entry is just a reference to the inode; this is not like ISO or VFAT, where the directory entry _is_ the inode. > > | On top of that, we have Microsoft and Java interoperability to > > | consider, distasteful as that may be to some. > > > > M$ has a pretty good implementation here. > > Java I18N sucks really bad. > > Could you give a quick description of why one of these is good > and the other bad, for the bennefit of someone who knows > neither? My take on this, which may not be the same as his, is that the Microsoft implementation uses the processing representation as the storage representation, whereas Java uses UTF-8 for the storage representation. Java also deals in strings composed of "bytes" instead of strings composed of "characters", which makes string processing problematic, if the string is an I18N string; consider that it has no functions similar to XPG/4 mbtowc() or other interning/externing functions that it would use to deal with them. It's kind of like the problem with Java letting you instance objects without a default constructor being required to make them valid; the JavaMail API is rife with examples of this type of thing. You can see it pretty easily, when you try to write those same interfaces in C++, since C++ doesn't permit that sort of thing to happen (instancing without initialization is not possible in C++; there is *always* a default constructor). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Thu Mar 1 13:15:16 2001 Delivered-To: freebsd-i18n@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 2208737B71A; Thu, 1 Mar 2001 13:15:12 -0800 (PST) (envelope-from tlambert@usr05.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id OAA05290; Thu, 1 Mar 2001 14:09:26 -0700 (MST) Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp04.primenet.com, id smtpdAAAQ9aOik; Thu Mar 1 14:09:10 2001 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id OAA06019; Thu, 1 Mar 2001 14:14:46 -0700 (MST) From: Terry Lambert Message-Id: <200103012114.OAA06019@usr05.primenet.com> Subject: Re: Unicode, command line options, and configuration files, oh my! To: keichii@peorth.iteration.net Date: Thu, 1 Mar 2001 21:14:46 +0000 (GMT) Cc: areilly@bigpond.net.au (Andrew Reilly), tlambert@primenet.com (Terry Lambert), jonathan@graehl.org (Jonathan Graehl), asmodai@FreeBSD.ORG, i18n@FreeBSD.ORG In-Reply-To: <20010301095049.A10822@peorth.iteration.net> from "Michael C . Wu" at Mar 01, 2001 09:50:49 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > | > | In general, this means that for Unicode data stored for > | > | directory entries would require that a directory entry > | > | block would have to be 512b, whereas for UTF-8, we are > | > | talking 2048b (2k). > | > | It would still have to be larger than 512b using a 16-bit > | encoding, wouldn't it? > > Yes, and if we are making it larger than 512b, why do we need > to set a limit on ourselves? Directory entry block I/O is not handled through the normal VFS code. THis is because the directory entry blocks need to be modified atomically, and FS blocs can span page boundaries; for a sufficiently large FS block size, frags can exceed the page size. For some architectures, the page size is not := 4k. You need to look at the UFS directory manipulation code in the /sys/ufs/ufs directory so that you can uderstand the problem; while you are at it, look at the fsck and newfs and otherFS utility code which has to deal with directory entry blocks. It is not pretty. It would be nearly imposible to do directory I/O in FS blocks, and keep it atomic. There is already the risk of a 1024b directory entry spanning a track boundary, because we do not read mode page 2 from SCSI, and prohibit track spanning by FS objects. > | How do you propose to do that and still maintain Unix inode/link > | semantics? There isn't (necessarily) only one file name that > | the user sees, but there _is_ only one lump of file data. > > Do you see why nobody has been able to solve all this stuff easily? Wrong; Matt Day, Mark Muhelestein, and myself solved exactly this problem in exactly the FreeBSD VFS architecture and exactly the FreeBSD FFS and UFS code back in 1997. > I think having a journaling filesystem could solve this. So can UFS/FFS. Journalling has nothing to do with the underlying problem here, which is conversion from a fixed length storage to a variable length storage, where the underlying media has fixed length blocks into which you have to map things. Consider a CDROM FS for music and video, running in a file set up as a device. The blocks of such an FS could not be aligned within a page, since they are odd sized. How do you mmap() an object in such an FS? > NTFS gives up the ability to switch charsets in the harddrives. > (It is a pretty good assumption, since most users stay within > two languages.) And most of the userland tools, even the simple ones, > work with other languages without modifications, when compiled > by Visual Studio. The OLE character tyes are 16 bit. Some of these interfaces are not available in all WIN32.DLL implementations. > Java uses a weird scheme to negotiate the contents, where > the server and the client both have to agree in the charset. > Then you have to wrap strings in special functions. Then you > have to specifically tell java that the input is "international" input. > bla bla bla....Generally bad design and a big hassle. > (Have you ever seen a Chinese/Japanese/Korean java-enabled website > that _works_? I have seen very very few.) That's because it considers any I/O to be externalization; that's a stupid assumption. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message