Date: Mon, 18 Sep 1995 13:04:47 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: kaleb@x.org (Kaleb S. KEITHLEY) Cc: terry@lambert.org, hackers@freefall.freebsd.org Subject: Re: Policy on printf format specifiers? Message-ID: <199509182004.NAA08387@phaeton.artisoft.com> In-Reply-To: <199509181147.HAA15090@exalt.x.org> from "Kaleb S. KEITHLEY" at Sep 18, 95 07:47:42 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > I'd also like the wchar_t value to be 16 rather than 32 bits. > > That would be a serious mistake. All modern OSes are using 32-bit wchar_t. > Don't take a step backward. Windows 95 and Windows NT, arguably the largest two potential sources of Unicode aware software use a 16 bit wchar_t. This is specific to the releases of MSVC2.x that support Unicode. It is also specific to the Borland and other compilers that support generation of Win32 applications. Since These OS's already support Unicode, they are arguably more "modern" than those which do not. > > Other > > than page 0 (Unicode), no other code pages in ISO-10646 have yet been > > allocated. > > Er, I don't have my copy of 10646 here at home. As I recall page 0 is just > Latin1. If page 0 is in fact Unicode, which already has encodings for every > written language on Earth, then what would 10646 need any other pages for? Page 0 is Unicode. Pages are 16 bits witha 16 bit page selector in 10646. This was a nod to the Japanese to get them to support the inclusion of Unicode in standardized software at all. The Japanese hate Unicode for several reasons, some of which only make sense if you're French. 8-). The main problem they see is that the CJK unification put the characters in Chinese dictionary order. The next main problem is that it is not intrinsically tagged by language, so it isn't easy to tell the difference between a Chinese and a Japanese document. Another problem is that it isn't multilingual. If I take a document in Japanese that explains Chinese poetry, with embedded examples, then I get Chinese or Japanese characters for all text. Unicode expects these types of encoding data to be escaped into the text -- what they call an RTF or "Rich Text Format" file. The Japanese already have this using ISO 2022 and JIS 208+212, which combined allow the encoding of 21 languages. Unicode does *not* include all languages on Earth; in particular, most dead languages aren't supported. I personally have a number of problems with the encoding ordering for ligatured languages, like Tamil, Devengari, Hebrew, Arabic, etc., since there is an implied inability to use fixed cell rendering technologies, like X. The other code pages currently are not *for* anything, since they are unassigned, but they are expected to be for languages not covered by Unicode in general and dead languages in particular, as well as giving the Japanese (or anyone else) the ability to attribute by language within the Unicode set by using the high bits as attribution. At the cost of double the storage requirements. > The 2.1.0-<mumble>SNAP has a Japanese EUC and Cyrillic code pages, which, > as I recall, are not on page 0. You are confusing the compatability regions for JIS-212. The Cyrillic *is* and always *has* been part of the Unicode standard. What hasn't been part of the standard is enforcement of KOI-8 character order. KOI-8 is a popular character set standard in the former Soviet Union, thought it is not yet supported by any national or international standards body. The main "win" for seperating these encodings is the ability to encode ordering information, which was expected to be encoded seperately from the Unicode standard in any case. For examply, many Northern European countries (like Germany) have multiple collation orders for alphabetization. Germany had "Dictionary" and "Telephone Book" orders which differ from each other. Again, at the cost of double the storage requirements. It's not expected that practical use will actually be made of the 10646 non-zero code pages anywhere in the near term, and even after that, it is expected that the pages will be used to resolve political rather than technical issues. > > This would affect constant ISO 8859-1 strings using the 'L' quailfier; > > for example: > > > > > > main() > > { > > printf( "%S\n", L"Hello World"); > > } > > > > To print a widechar string you should convert it to a multi-byte string > with wcstombs and then print it. This sucks. It assumes runic encoding for input to your display/rendering technology. This is *exactly* what Taligent was pushing when they set the adjacency of the characters in ligatured languages such that there was insufficient "private use areas" to embed the prerendered fixed cell forms for ligatured characters. This has the effect of deprecating X as a display technology because of the use of downloaded prerendered fonts that are blitted to the screen. Prerendered fonts can only have predefined ligature points if there are holes for glyph variants. > Because you're asking for 16-bit wchar_t > I presume you have a large number of strings and are concerned about the > amount of space they'll use when stored in your program file. If that's > the case your strings should be stored in locale specific message catalogs. No, the 16-bit wchar_t is a concernt for compatability with other systems, a desire to avoid runic encodindings which ruin the usability of fixed fileds in data entry and back end storage systems, NFS exported file system interoperability, and fixed directory entry block sizes of 1K or less for a 255 glyph file name component. I'm more than a bit worried about storage of information in a process encoding form so as to avoid the process/storage encoding translation overhead and the destruction of meaningful information by runic encoding file expansion, but this issue is secondary. > Because wchar_t is different, i.e. 16-bit on some systems, 32-bit on others, > you never store wchar_t strings in a file. You always convert them to > multi-byte strings with wcstombs before writing to a file. Rendering the file length meaningless and requiring the use of record oriented file systems with variant length records to handle data from fix length input fields from user interaction screens. Runic storage encoding: Just Say No. > Since the locale > the file was created in is not recorded in the file the burden is on the > user to remember and use the correct locale when rereading the file and > convert it back to a wchar_t string with mbstowcs. Yeah. That's the file attribution problem. But if you only care about internationalization (enabling a program or OS for data-driven localization to a single language) instead of about multinationalization (enabling a program or OS for multilingual support for characters which intersect in the unified international character set, like the Chinese and Japanese Glyphs for the unified ideogram "grass", for inherently multilingual use), then the problem is lessened. You still don't lose the ability to provide language attribution, it's just that *that's* when you go to what Unicode called "Rich Text Format" (and what the rest of us call "Compound Documents"). If you want to think about it for a bit, language encoding attribution for files, where you don't store everything as raw 16 bit wchar_t's (for Unicode) or raw 32 bit wchar_t's (for ISO 10646, with every other 16 bits '0x0000', since no code pages other than 0 are assigned) is tantamount to specifying a compression schema. An 8 bit storage encoding of Latin 1 (ISO-8859-1) is tantamount to a "compressed" Unicode document that was compressed using compression technique "ISO 8859-1", taking advantage of symmetry in the data to do the compression. This resolves the attribution issue by divorcing it from the need to use attribution on the data streams associated with the file (Ohta's argument against file attribution). Data pushed out of the file system is expanded in the file system buffers on the way out. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199509182004.NAA08387>