From owner-freebsd-hackers Wed Jun 10 22:47:54 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id WAA26244 for freebsd-hackers-outgoing; Wed, 10 Jun 1998 22:47:54 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from newserv.urc.ac.ru (newserv.urc.ac.ru [193.233.85.48]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id WAA26127 for ; Wed, 10 Jun 1998 22:47:16 -0700 (PDT) (envelope-from joy@urc.ac.ru) Received: from urc.ac.ru (y.urc.ac.ru [193.233.85.37]) by newserv.urc.ac.ru (8.8.8/8.8.8) with ESMTP id LAA27440; Thu, 11 Jun 1998 11:41:33 +0600 (ESS) (envelope-from joy@urc.ac.ru) Message-ID: <357F6E0D.FE51B0B2@urc.ac.ru> Date: Thu, 11 Jun 1998 11:41:33 +0600 From: Konstantin Chuguev Organization: South Ural Regional Center of FREEnet X-Mailer: Mozilla 4.05 [en] (X11; I; FreeBSD 3.0-CURRENT i386) MIME-Version: 1.0 To: Terry Lambert CC: Gary Kline , hackers@FreeBSD.ORG Subject: Re: internationalization References: <199806102155.OAA13862@usr01.primenet.com> Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Terry Lambert wrote: > > I would prefer going to a full-on Unicode implementation to support > all known human languages. > I agree at least because Unicode is not just a character set or subset of ISO 10646, but a database of character mnemonic names, collation rules, bidirectional writing, uppercasing, lowercasing, transliteration rules. This has huge importance in text processing. I am afraid ISO 2022 lacks these capabilities. Another thing is very simple conversion mechanism between UTF-8 and UCS-16/32, i.e. multibyte and wide character encodings. We need both encodings: the first for ASCII compatibility (and C zero-byte ended (char *)strings compatibility) and the second for fast searching/sorting. > I would suggest an initial 16 bit wchar_t with an assumption of a > zero valued code page designator. If ISO ever gets around to adding > other code pages, we can deal with that at that time using page > selection. Meanwhile, we'll be able to interportate with Microsoft > and JAVA, which use 16 bit wchar_t encodings. > > I think the first (and hardest) step is the shells. The shells need > to be internationalized based on the fact that they (can) intrpret > exit codes to the user as error messages. > > The last time I converted csh, this was absolute hell because the > code was badly organized for internationalization. > > The next hardest step is the editors, starting with "vi". They have > to be able to support Unicode. > That consists of 2 levels: character set level (wchar, mbyte, conversion, locale's LANG etc.) and message catalogues (locale's LANG). IMO, the second should be done only after the first is precisely developed. > I have had FS-based Unicode support working for a very long time, > though it has failed to be committed. One big issue is that directory > entry blocks must grow from 512b to 1k. This has a number of > implications to the soft updates work currently in progress. This is > because, in order to support a maximally sized path component, 512 + 24 > bytes is needed for unicaode, as opposed to 256 + 24 (which fits in 512b) > for an 8 bit charaacter set. > Do you mean processing UCS-16 in the kernel (FS-level)? I'm asking about it because any application is expecting 8-bit character zero-ended strings as file names. It does not matter if it is ASCII or any multibyte charset. So then we need a conversion between UCS-16 and UTF-8 (or probably locale's charset) in the kernel. > If we were to do something stupid, like UTF-7 or UTF-8, it would have > to grow to 5 * 256 + 24, minimally, to support 5:1 character expansion > possible, as opposed to the 2:1 of flat Unicode encoding. > > For character set attributed FS's (like NFS v2/v3 will have to be), you > can do the translation in in the kernel on the blocks on their way out > (a 2:1 expnasion in memory of a 1:1 disk image for a given ISO character > set attribution for the filesystem). > Another reason for including conversion routines into the kernel. -- Konstantin V. Chuguev. System administrator of Southern http://www.urc.ac.ru/~joy/ Ural Regional Center of FREEnet, mailto:joy@urc.ac.ru Chelyabinsk, Russia. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message