Date: Wed, 22 May 1996 14:06:39 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: kline@tera.com (Gary Kline) Cc: terry@lambert.org, kline@tera.com, freebsd-questions@freebsd.org Subject: Re: Utilities and POSIX compliance.... Message-ID: <199605222106.OAA05021@phaeton.artisoft.com> In-Reply-To: <199605221823.LAA23578@athena.tera.com> from "Gary Kline" at May 22, 96 11:23:53 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > > `wc' is missing the -m (multibyte) flag, and I expect that > > > other of the language/locale-specific utilities are missing > > > these hooks. > > > > "Multibyte" is an evil, evil implementation of internationalization > > (the process of making software localizable to a particular locale > > using only data and environment, not code changes). It doesn't > > deal at all with multinationalization (the process of making software > > capable of simultaneously operating in several locales, generally > > useful only for translators and language scholars). > > > I'll buy your second premise, but not necessairily > your first. Making everything multinational > would probably take man-decades. Do you have a > better idea? Better meaning realistically doable. Yes. Attribute data streams as 8 bit or 16 bit sources, and do translation from 16 bit Unicode to a local round-trip character set in the kernel. This would let you not use runic encoding, yet still maintain backward compatability with NFS servers and disk volumes with no "IS_UNICODE" attribute. If you want to have a laymans explanation, consider it as disk to VM buffer level 1->2 block level decompression/compression. For i18n, no file contents need change (runic encoding requires alll files not written in vanilla US ASCII to get larger). The utilities all go to wchar_t (which I really would recommmend be 16 bit Unicode; code pages other than 16:0 in ISO 10646 have yet to be defined; tyhey are an appeasement mechanism for ethnic purists who want characters attributed by language of origin for unified glyph sets anyay -- not a technical issue at all). The technical reasoning for a 16 bit wchar_t: that's what Microsoft uses for NT and Windows 95; there is really no reason to shoot ourselves in the porting-NT/95-code-to-FreeBSD foot, right? Going to wchar_t for all utilities would be rather easy; the biggest pain would be termios on tty's, vty's, and pty's. The vty code wants to be handled using user space sesssion management on pty's anyway, which leaves tty's and pty's. The pty code wants to be rewritten to use device cloning using the devfs. So really, the only legacy code we really care about is tty's. Since there are no Unicode terminals, the stat stream will always have the 8 bit attribution. There is still a need to set a local character set to Unicode translator at the device level for ISO character set specific terminals for round-tripping. This is handled via 2 256 byte tables and an extent index, or at most, a 64k sparse table for something like ISO-2022 for KanjiHand or other NEC/DOSV input devices acting as terminals. The LOCALE variable is still used -- but only for message catalogs. Very low real overhead, in other words. The move from char to wchar_t is pretty simple, at least for those of us with cscope. It is capable of near automation for most code. As an initial assumption, the termios code could assume an 8 bit input method. In the final implementation, the input method should be largely irrelevant -- it's a device tagging issue. This also has the effect of: 1) Fixed field input buffers are still fixed length and mathematically related to field length. 2) wtomb and similar crap "goes away". 3) Local settings "go away". The issue is one of font on your display device vs. round tripping of lexical values for character encoding using small tables. 4) It's possible to have multiple languages in use by different users on the same system. 5) Fixed field storage for fixed input fields no longer require variable length records for actual data storage. 6) The length of a data file is still meaningful relative to character count, or dividing record length into file length to get a record count. All positive wins. > > > I'd like to know why more of the Berkeley utilities aren't > > > POSIX-compliant. That is, why, without some minor--or even > > > major--hacks, these utilities haven't been brought up to > > > standard. The BSD kernel is A++, but not the utils... . > > > > I believe they are all i18n. The general consensus is to not > > POSIX'ify if there will be a significant loss of functionality, > > or if doing so would mean moving from a BS source to a GPL'ed one. > > > How would adding more of the POSIX standards > cause a loss of functionality?? From what is > in the 4.4final release of BSD (1993/4), most > of the utility set are, worst case, missing only > a few flags. XPG/4 uses runic encoding, which is inherently flawed. Complying with this portion of POSIX would be a grave error. It would set back true internationalization compatible with Unicode 1.x standard compound document architecture based multinationalization. POSIX supports only operation in a single locale for any application. This means, for instance, it would be near impossible to build an application for use by message catalog translators without going outside the POSIX standard, and reimplementing multisession XPG/4 (which POSIX does not specify) at great cost in time and effort. It's well known that all i18n 8-bit character sets have support for US AASCII (at least the ISO 8859 sets do), as do some of the 16 bit character sets (most notably JIS 208 + JIS 212, which in agregate actually supports 21 of the most common languages). The reason for not simply using JIS 208 + JIS 212 with ISO 2022 shift encoding is that Unicode in general, and the 16:0 codepage of ISO 10646 in particular, support much, much more than 21 human languages. For instance, 1/5th of the worlds population is in India, yet JIS 208 + JIS 212 do not provide any support whatsoever for Indic scripts, like Tamil, Devengari, etc.. For right now, we should ignore the allocation of the Unicode "private use space" areas relative to character sets requiring ligatured fonts (Hebrew, Arabic, Tamil, Devengari, Sanskrit, etc.), even though this seriously biases against fixed cell rendering technologies, like that used by sconsole/pccons and X windows (it's understndable: Taligent is seriously biased toward PostScript). There are non-trivial workarounds (like "xtamil") if that becomes an issue. > Before remembering the FSF's work, I hacked some > of the BSD utilities into compiliance. Then > found that GNU has the majority of the utilities > re-written. The code ought to parallelize nicely, > and even if not, having the POSIX compliance > shouldn't cause any of functional degradation. > (Speculation:: I haven't tested my GNU ports yet.) The problem is in the limitations imposed by the POSIX concept of multibyte, and has nothing whatsoever to do with the quality of the code used to code to the XPG/3 and/or XPG/4 interfaces to implement it. Sorry if that wasn't clear. > > I believe nmost of the Lite2 code has not been integrated -- there > > are supposedly some serious strides towards POSIX in some of the > > unintegrated code. > > > Thanks for the tip. Do you know if it is the > Lite2 code on the Walnut Creek CD? It might be > a big win to have the latest version of the Lite > release around. BTW, am I right to assume that > Lite itself is dead? Can't imagine anyone hacking > on that stuff, but then... . The code is being slowly integrated into the main line sources tree. I have no idea about the user space stuuf, really, since I am mostly a kernel geek. 8-). I know that the Lite2 is available on CD from Walnut Creek, and I know that it's on line on freefall (or was, starting about a year ago when I first submitted my FS patches, some of which were intended to pave the way for support of Unicode directory name spaces for FS's). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199605222106.OAA05021>