Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 May 1996 14:06:39 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        kline@tera.com (Gary Kline)
Cc:        terry@lambert.org, kline@tera.com, freebsd-questions@freebsd.org
Subject:   Re: Utilities and POSIX compliance....
Message-ID:  <199605222106.OAA05021@phaeton.artisoft.com>
In-Reply-To: <199605221823.LAA23578@athena.tera.com> from "Gary Kline" at May 22, 96 11:23:53 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > >     `wc' is missing the -m (multibyte) flag, and I expect that 
> > >     other of the language/locale-specific utilities are missing
> > >     these hooks.
> > 
> > "Multibyte" is an evil, evil implementation of internationalization
> > (the process of making software localizable to a particular locale
> > using only data and environment, not code changes).  It doesn't
> > deal at all with multinationalization (the process of making software
> > capable of simultaneously operating in several locales, generally
> > useful only for translators and language scholars).
> 
> 		
> 		I'll buy your second premise, but not necessairily
> 		your first.  Making everything multinational 
> 		would probably take man-decades.  Do you have a 
> 		better idea?  Better meaning realistically doable.

Yes.  Attribute data streams as 8 bit or 16 bit sources, and do
translation from 16 bit Unicode to a local round-trip character
set in the kernel.

This would let you not use runic encoding, yet still maintain
backward compatability with NFS servers and disk volumes with
no "IS_UNICODE" attribute.

If you want to have a laymans explanation, consider it as disk
to VM buffer level 1->2 block level decompression/compression.

For i18n, no file contents need change (runic encoding requires
alll files not written in vanilla US ASCII to get larger).

The utilities all go to wchar_t (which I really would recommmend
be 16 bit Unicode; code pages other than 16:0 in ISO 10646 have
yet to be defined; tyhey are an appeasement mechanism for ethnic
purists who want characters attributed by language of origin for
unified glyph sets anyay -- not a technical issue at all).  The
technical reasoning for a 16 bit wchar_t: that's what Microsoft
uses for NT and Windows 95; there is really no reason to shoot
ourselves in the porting-NT/95-code-to-FreeBSD foot, right?

Going to wchar_t for all utilities would be rather easy; the biggest
pain would be termios on tty's, vty's, and pty's.  The vty code
wants to be handled using user space sesssion management on pty's
anyway, which leaves tty's and pty's.  The pty code wants to be
rewritten to use device cloning using the devfs.  So really, the
only legacy code we really care about is tty's.

Since there are no Unicode terminals, the stat stream will always
have the 8 bit attribution.  There is still a need to set a local
character set to Unicode translator at the device level for ISO
character set specific terminals for round-tripping.  This is
handled via 2 256 byte tables and an extent index, or at most,
a 64k sparse table for something like ISO-2022 for KanjiHand
or other NEC/DOSV input devices acting as terminals.

The LOCALE variable is still used -- but only for message catalogs.

Very low real overhead, in other words.



The move from char to wchar_t is pretty simple, at least for those
of us with cscope.  It is capable of near automation for most
code.



As an initial assumption, the termios code could assume an 8 bit
input method.  In the final implementation, the input method should
be largely irrelevant -- it's a device tagging issue.



This also has the effect of:

1)	Fixed field input buffers are still fixed length and
	mathematically related to field length.

2)	wtomb and similar crap "goes away".

3)	Local settings "go away".  The issue is one of font on
	your display device vs. round tripping of lexical values
	for character encoding using small tables.

4)	It's possible to have multiple languages in use by different
	users on the same system.

5)	Fixed field storage for fixed input fields no longer require
	variable length records for actual data storage.

6)	The length of a data file is still meaningful relative to
	character count, or dividing record length into file
	length to get a record count.


All positive wins.


> > >     I'd like to know why more of the Berkeley utilities aren't
> > >     POSIX-compliant.  That is, why, without some minor--or even
> > >     major--hacks, these utilities haven't been brought up to 
> > >     standard.  The BSD kernel is A++, but not the utils... .
> > 
> > I believe they are all i18n.  The general consensus is to not
> > POSIX'ify if there will be a significant loss of functionality,
> > or if doing so would mean moving from a BS source to a GPL'ed one.
> 
> 
> 		How would adding more of the POSIX standards
> 		cause a loss of functionality??  From what is 
> 		in the 4.4final release of BSD (1993/4), most
> 		of the utility set are, worst case, missing only
> 		a few flags.

XPG/4 uses runic encoding, which is inherently flawed.  Complying
with this portion of POSIX would be a grave error.  It would set
back true internationalization compatible with Unicode 1.x standard
compound document architecture based multinationalization.  POSIX
supports only operation in a single locale for any application.

This means, for instance, it would be near impossible to build
an application for use by message catalog translators without
going outside the POSIX standard, and reimplementing multisession
XPG/4 (which POSIX does not specify) at great cost in time and
effort.


It's well known that all i18n 8-bit character sets have support
for US AASCII (at least the ISO 8859 sets do), as do some of the
16 bit character sets (most notably JIS 208 + JIS 212, which in
agregate actually supports 21 of the most common languages).

The reason for not simply using JIS 208 + JIS 212 with ISO 2022
shift encoding is that Unicode in general, and the 16:0 codepage
of ISO 10646 in particular, support much, much more than 21
human languages.  For instance, 1/5th of the worlds population
is in India, yet JIS 208 + JIS 212 do not provide any support
whatsoever for Indic scripts, like Tamil, Devengari, etc..

For right now, we should ignore the allocation of the Unicode
"private use space" areas relative to character sets requiring
ligatured fonts (Hebrew, Arabic, Tamil, Devengari, Sanskrit, etc.),
even though this seriously biases against fixed cell rendering
technologies, like that used by sconsole/pccons and X windows
(it's understndable: Taligent is seriously biased toward
PostScript).  There are non-trivial workarounds (like "xtamil")
if that becomes an issue.



> 		Before remembering the FSF's work, I hacked some
> 		of the BSD utilities into compiliance.  Then
> 		found that GNU has the majority of the utilities
> 		re-written.  The code ought to parallelize nicely,
> 		and even if not, having the POSIX compliance
> 		shouldn't cause any of functional degradation.
> 		(Speculation:: I haven't tested my GNU ports yet.)

The problem is in the limitations imposed by the POSIX concept
of multibyte, and has nothing whatsoever to do with the quality
of the code used to code to the XPG/3 and/or XPG/4 interfaces
to implement it.  Sorry if that wasn't clear.



> > I believe nmost of the Lite2 code has not been integrated -- there
> > are supposedly some serious strides towards POSIX in some of the
> > unintegrated code.
> 
> 
> 		Thanks for the tip.  Do you know if it is the
> 		Lite2 code on the Walnut Creek CD?  It might be
> 		a big win to have the latest version of the Lite
> 		release around.  BTW, am I right to assume that
> 		Lite itself is dead?  Can't imagine anyone hacking
> 		on that stuff, but then... .

The code is being slowly integrated into the main line sources
tree.  I have no idea about the user space stuuf, really, since I
am mostly a kernel geek.  8-).

I know that the Lite2 is available on CD from Walnut Creek, and
I know that it's on line on freefall (or was, starting about a
year ago when I first submitted my FS patches, some of which were
intended to pave the way for support of Unicode directory name
spaces for FS's).



					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199605222106.OAA05021>