Date: Fri, 25 Aug 2000 13:35:38 +0100 From: Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk> To: "Michael C. Wu" <keichii@peorth.iteration.net> Cc: Boris Popov <bp@butya.kz>, freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org Subject: Re: Proposal to include iconv library in the base system. Message-ID: <39A6681A.E9337835@dante.org.uk> References: <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz> <20000824172601.A1353@peorth.iteration.net>
next in thread | previous in thread | raw e-mail | index | archive | help
"Michael C. Wu" wrote: > On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov scribbled: > | > | 4) I do not think the charset tables will be bigger than 15mb total. > ---end quoted text--- > How about having two sets of conversion tables? > > One set would be the plain text files and the other would be > in Berkeley DB format to allow for faster system look-up's. > The current charset table format is produced as follows: There are two sources for table files: plain text tables from the Unicode WWW/FTP site, and RFC1345. All this stuff is located in the developer's internal directory. C source files for every charset are created from the table files by a Perl script. This is done before creating the source code distribution package. It is OK to provide the Perl script in the distribution, but I see no reason to use it every time when compiling the sources. The produced C files contain conversion tables for both conversions from Unicode and to Unicode; each conversion table can be either array of 128 or 256 unsigned shorts, or the array of pointers to arrays of 128/256 ushorts. Each file also contains internal functions and a few structures for "virtual methods". C files are smaller than corresponding Unicode files, but bigger than corresponding RFC1345 entries. .so dynamic loadable modules are made at compile time from the C files. They are much smaller than the C files. I believe that the resulting .so files are smaller than corresponding DB tables would be, and even smaller than CDB "constant database" files. The lookup is also faster, but what is more important, it is not necessarily based on any tables. As I said before, the conversion to/from Unicode is done by calling a (virtual) method in a .so module. This will allow us, for example, to create easily a new conversion table module for mapping between CJK standard charsets and a new Unicode version containing all ideographs from the Kangxi Dictionary and some other CJK characters (http://www.unicode.org/unicode/alloc/Pipeline.html). As the new characters occupy Plane 2 of ISO-10646, we will need another format for the table. But this will not affect the library and other modules. -- * * Konstantin Chuguev - Application Engineer * * Francis House, 112 Hills Road * Cambridge CB2 1PQ, United Kingdom D A N T E WWW: http://www.dante.net To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?39A6681A.E9337835>