FreeBSD Mail Archives

Date:      Fri, 01 Sep 2000 17:16:38 +0100
From:      Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
To:        "Andrey A. Chernov" <ache@nagual.pp.ru>
Cc:        Boris Popov <bp@butya.kz>, freebsd-arch@FreeBSD.ORG, freebsd-i18n@FreeBSD.ORG
Subject:   Re: Proposal to include iconv library in the base system.
Message-ID:  <39AFD666.880FE6C@dante.org.uk>
References:  <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz> <20000901185945.A29804@nagual.pp.ru>

index | next in thread | previous in thread | raw e-mail

"Andrey A. Chernov" wrote:

> On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov wrote:
> > FreeBSD already contains a few character conversion schemes for
> > msdosfs, nwfs, cd9660fs and syscon mapping tables.  However, the usage
>
> We need XLAT converters for them, not Unicode one, as I understand Unicode
> data loaded into kernel will be too big.
>

It depends on what do you mean by Unicode data.
At the Unicode site there is a plain/text table with Unicode data, with the
number of recored approximately equal 0xFFFF - <CJK ideographs>. Each record
corresponds to a Unicode character and can have up to 10 or more fields,
among them the canonical name of the character, information about capital and
small letters, directionality and so on. This information is intended to be
used for processing/[de]normalizing Unicode text.
All this is not needed for the charset conversion.

The most commonly used 8-bit charsets for filesystems are ISO-8859,
Windows-125x, IBM-86x and KOI8-R.
It is easy to create XLAT tables for conversion between pairs of these
charsets; obviously we won't need the "full mesh" of pairs here, only tables
for charsets used for the same language. Simplifying that, we will need N * 2
* 256bytes for all tables.
There are two problems here:

   * it is not so easy to guess the N number. For 4 charsets used for Russian
     we will need 6 * 2 tables (for 5 charsets - 10 * 2 tables); add other
     languages;
   * new filesystems use Unicode encodings: UCS-2 (Windows), some may use
     UTF-8. These encodings are not supported by XLAT.

iconv CCS modules consist of 2 tables each. One table is for translation from
the charset to Unicode (UCS-4), the other - from UCS-4 to the charset. There
are 4 different table types currently supported: 7-bit, 8-bit, 14-bit and
16-bit. The table layout is hidden from the module interface, two functions
actually doing the conversion are as follows (names of functions and
arguments here don't match exactly those in the iconv implementation):
ucs4_t iconv_ccs_convert_to_ucs(void *module, ucs4_t charset_char);
ucs4_t iconv_ccs_convert_from_ucs(void *module, ucs4_t ucs_char);

The internal 14-bit and 16-bit tables are two-level, not flat.

There is one more type of CCS modules I'm thinking about: 32-bit tables for
translation between full range of CJK characters in, say, BIG5 or CNS11643,
and the next version of Unicode standard with these character added.

  1. charsets for right-to-left scripts; they use special control characters
     for changing the direction of writing; the algorithm is different from
     the Unicode one; more complicated logic is necessary for these charsets.

The second type of modules is CES - character encoding schemes. Their
interface is similar to that below:
ucs4_t ccs_convert_to_ucs(void *module, unsigned char **srcstr, unsigned
*srcbytelen);
int ccs_convert_from_ucs(void *module, unsigned char **dststr, unsigned
*dstbytelen, ucs4_t srcchar);

The difference from CCS is that there is no fixed-length correspondense
between the UCS and original charset characters. Each UCS character can be
translated to/from 0 to N bytes of a text encoded in the original character
encoding scheme. Now the supported schemes are:

   * _tbl_simple - used for most European charsets (for ASCII and all 8-bit
     charsets); it simply uses the corresponding CCS module;
   * EUC family for CJK;
   * ISO-2022-xx for CJK;
   * UCS-4, UCS-2, UTF-16, UTF-8, UTF-7.

A new type of CES modules can be the one for charsets used for Arabic and
Hebrew, where more complicated algorithm used to convert directionality
control characters from/to Unicode.

Now, all the modules are loadable and shareable. If the system is using a
fair amount of charsets at the same time, the amount of table data loaded
into kernel can be actually smaller than when loading all corresponding pairs
of XLAT tables.

>
> > The questionable part is a which set of character sets should be
> > included in the base system and which should be supplied as packages.
>
> We need to include all charsets we have locale support in the base system.
>

Exactly, this is what was intended. All [UNIX] charsets supported in the
FreeBSD distribution (i.e. which are present in the locale directory) PLUS
charsets used in other types of filesystems (Windows, Netware?, MacOS?) for
the languages supported by FreeBSD (see locale again). Otherwise there is no
much need to include iconv to the kernel at the moment.
Perhaps, minus CJK charsets due to their size. I don't know if there is a
need of CJK charset conversion for filesystems.
All other modules can easily be installed from ports/packages.

>
> > Secondly, where should the functions be placed? Initially, the iconv
>
> /usr/libdata/iconv
>

I think this case is much the same as for PAM modules.

>
> What I am not understand at this moment: how iconv handles non-convertable
> characters? I don't see any way to set fill character in described
> interface.
>

According to the standard, iconv stops when it finds an illegal sequence of
bytes in the source byte sequence (input charset). If there is no
corresponding character in the destination charset, the behaviour of iconv is
implementation-dependent. My implementation currently translates it into a
predefined (at compile time) substitution character. Don't remember though
whether it is '_' or '?' :-)

I will try my best to produce the final version 1.0 of the library and
conversion modules before Monday.

--
          * *        Konstantin Chuguev - Application Engineer
       *      *              Francis House, 112 Hills Road
     *                       Cambridge CB2 1PQ, United Kingdom
 D  A  N  T  E       WWW:    http://www.dante.net

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?39AFD666.880FE6C>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation