Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Aug 2000 13:35:38 +0100
From:      Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
To:        "Michael C. Wu" <keichii@peorth.iteration.net>
Cc:        Boris Popov <bp@butya.kz>, freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org
Subject:   Re: Proposal to include iconv library in the base system.
Message-ID:  <39A6681A.E9337835@dante.org.uk>
References:  <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz> <20000824172601.A1353@peorth.iteration.net>

next in thread | previous in thread | raw e-mail | index | archive | help
"Michael C. Wu" wrote:

> On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov scribbled:
> |
> | 4) I do not think the charset tables will be bigger than 15mb total.
> ---end quoted text---
> How about having two sets of conversion tables?
>
> One set would be the plain text files and the other would be
> in Berkeley DB format to allow for faster system look-up's.
>

The current charset table format is produced as follows:

There are two sources for table files: plain text tables from the Unicode WWW/FTP
site, and RFC1345. All this stuff is located in the developer's internal
directory.
C source files for every charset are created from the table files by a Perl
script. This is done before creating the source code distribution package.
It is OK to provide the Perl script in the distribution, but I see no reason to
use it every time when compiling the sources.

The produced C files contain conversion tables for both conversions from Unicode
and to Unicode; each conversion table can be either array of 128 or 256 unsigned
shorts, or the array of pointers to arrays of 128/256 ushorts. Each file also
contains internal functions and a few structures for "virtual methods".
C files are smaller than corresponding Unicode files, but bigger than
corresponding RFC1345 entries.

.so dynamic loadable modules are made at compile time from the C files. They are
much smaller than the C files.

I believe that the resulting .so files are smaller than corresponding DB tables
would be, and even smaller than CDB "constant database" files. The lookup is also
faster, but what is more important, it is not necessarily based on any tables. As
I said before, the conversion to/from Unicode is done by calling a (virtual)
method in a .so module.
This will allow us, for example, to create easily a new conversion table module
for mapping between CJK standard charsets and a new Unicode version containing
all ideographs from the Kangxi Dictionary and some other CJK characters
(http://www.unicode.org/unicode/alloc/Pipeline.html). As the new characters
occupy Plane 2 of ISO-10646, we will need another format for the table. But this
will not affect the library and other modules.

--
          * *        Konstantin Chuguev - Application Engineer
       *      *              Francis House, 112 Hills Road
     *                       Cambridge CB2 1PQ, United Kingdom
 D  A  N  T  E       WWW:    http://www.dante.net





To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?39A6681A.E9337835>