FreeBSD Mail Archives

Date:      Mon, 18 Sep 1995 13:04:47 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        kaleb@x.org (Kaleb S. KEITHLEY)
Cc:        terry@lambert.org, hackers@freefall.freebsd.org
Subject:   Re: Policy on printf format specifiers?
Message-ID:  <199509182004.NAA08387@phaeton.artisoft.com>
In-Reply-To: <199509181147.HAA15090@exalt.x.org> from "Kaleb S. KEITHLEY" at Sep 18, 95 07:47:42 am

> > I'd also like the wchar_t value to be 16 rather than 32 bits.  
> 
> That would be a serious mistake. All modern OSes are using 32-bit wchar_t.
> Don't take a step backward.

Windows 95 and Windows NT, arguably the largest two potential sources of
Unicode aware software use a 16 bit wchar_t.  This is specific to the
releases of MSVC2.x that support Unicode.  It is also specific to the
Borland and other compilers that support generation of Win32 applications.

Since These OS's already support Unicode, they are arguably more "modern"
than those which do not.

> > Other
> > than page 0 (Unicode), no other code pages in ISO-10646 have yet been
> > allocated.
> 
> Er, I don't have my copy of 10646 here at home. As I recall page 0 is just
> Latin1. If page 0 is in fact Unicode, which already has encodings for every 
> written language on Earth, then what would 10646 need any other pages for?

Page 0 is Unicode.  Pages are 16 bits witha 16 bit page selector in 10646.

This was a nod to the Japanese to get them to support the inclusion of
Unicode in standardized software at all.  The Japanese hate Unicode for
several reasons, some of which only make sense if you're French.  8-).

The main problem they see is that the CJK unification put the characters
in Chinese dictionary order.

The next main problem is that it is not intrinsically tagged by language,
so it isn't easy to tell the difference between a Chinese and a Japanese
document.

Another problem is that it isn't multilingual.  If I take a document in
Japanese that explains Chinese poetry, with embedded examples, then I
get Chinese or Japanese characters for all text.  Unicode expects these
types of encoding data to be escaped into the text -- what they call an
RTF or "Rich Text Format" file.  The Japanese already have this using
ISO 2022 and JIS 208+212, which combined allow the encoding of 21 languages.

Unicode does *not* include all languages on Earth; in particular, most
dead languages aren't supported.

I personally have a number of problems with the encoding ordering for
ligatured languages, like Tamil, Devengari, Hebrew, Arabic, etc., since
there is an implied inability to use fixed cell rendering technologies,
like X.

The other code pages currently are not *for* anything, since they are
unassigned, but they are expected to be for languages not covered by
Unicode in general and dead languages in particular, as well as giving
the Japanese (or anyone else) the ability to attribute by language within
the Unicode set by using the high bits as attribution.

At the cost of double the storage requirements.

> The 2.1.0-<mumble>SNAP has a Japanese EUC and Cyrillic code pages, which, 
> as I recall, are not on page 0.

You are confusing the compatability regions for JIS-212.  The Cyrillic *is*
and always *has* been part of the Unicode standard.

What hasn't been part of the standard is enforcement of KOI-8 character
order.  KOI-8 is a popular character set standard in the former Soviet
Union, thought it is not yet supported by any national or international
standards body.

The main "win" for seperating these encodings is the ability to encode
ordering information, which was expected to be encoded seperately from
the Unicode standard in any case.  For examply, many Northern European
countries (like Germany) have multiple collation orders for alphabetization.
Germany had "Dictionary" and "Telephone Book" orders which differ from
each other.

Again, at the cost of double the storage requirements.

It's not expected that practical use will actually be made of the 10646
non-zero code pages anywhere in the near term, and even after that, it
is expected that the pages will be used to resolve political rather than
technical issues.

> > This would affect constant ISO 8859-1 strings using the 'L' quailfier;
> > for example:
> > 
> > 
> > main()
> > {
> > 	printf( "%S\n", L"Hello World");
> > }
> > 
> 
> To print a widechar string you should convert it to a multi-byte string
> with wcstombs and then print it.

This sucks.  It assumes runic encoding for input to your display/rendering
technology.  This is *exactly* what Taligent was pushing when they set
the adjacency of the characters in ligatured languages such that there
was insufficient "private use areas" to embed the prerendered fixed cell
forms for ligatured characters.  This has the effect of deprecating X as
a display technology because of the use of downloaded prerendered fonts
that are blitted to the screen.  Prerendered fonts can only have predefined
ligature points if there are holes for glyph variants.


> Because you're asking for 16-bit wchar_t
> I presume you have a large number of strings and are concerned about the
> amount of space they'll use when stored in your program file. If that's
> the case your strings should be stored in locale specific message catalogs.

No, the 16-bit wchar_t is a concernt for compatability with other systems,
a desire to avoid runic encodindings which ruin the usability of fixed
fileds in data entry and back end storage systems, NFS exported file
system interoperability, and fixed directory entry block sizes of 1K or
less for a 255 glyph file name component.

I'm more than a bit worried about storage of information in a process
encoding form so as to avoid the process/storage encoding translation
overhead and the destruction of meaningful information by runic encoding
file expansion, but this issue is secondary.

> Because wchar_t is different, i.e. 16-bit on some systems, 32-bit on others,
> you never store wchar_t strings in a file. You always convert them to
> multi-byte strings with wcstombs before writing to a file.

Rendering the file length meaningless and requiring the use of record
oriented file systems with variant length records to handle data from
fix length input fields from user interaction screens.

Runic storage encoding: Just Say No.

> Since the locale 
> the file was created in is not recorded in the file the burden is on the 
> user to remember and use the correct locale when rereading the file and 
> convert it back to a wchar_t string with mbstowcs.

Yeah.  That's the file attribution problem.  But if you only care about
internationalization (enabling a program or OS for data-driven localization
to a single language) instead of about multinationalization (enabling a
program or OS for multilingual support for characters which intersect
in the unified international character set, like the Chinese and Japanese
Glyphs for the unified ideogram "grass", for inherently multilingual use),
then the problem is lessened.  You still don't lose the ability to provide
language attribution, it's just that *that's* when you go to what Unicode
called "Rich Text Format" (and what the rest of us call "Compound Documents").

If you want to think about it for a bit, language encoding attribution for
files, where you don't store everything as raw 16 bit wchar_t's (for Unicode)
or raw 32 bit wchar_t's (for ISO 10646, with every other 16 bits '0x0000',
since no code pages other than 0 are assigned) is tantamount to specifying
a compression schema.  An 8 bit storage encoding of Latin 1 (ISO-8859-1) is
tantamount to a "compressed" Unicode document that was compressed using
compression technique "ISO 8859-1", taking advantage of symmetry in the
data to do the compression.

This resolves the attribution issue by divorcing it from the need to use
attribution on the data streams associated with the file (Ohta's argument
against file attribution).  Data pushed out of the file system is expanded
in the file system buffers on the way out.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199509182004.NAA08387>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation