From owner-freebsd-i18n  Thu Aug 24  3:48:25 2000
Delivered-To: freebsd-i18n@freebsd.org
Received: from relay.butya.kz (butya-gw.butya.kz [212.154.129.94])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1A77537B423; Thu, 24 Aug 2000 03:48:15 -0700 (PDT)
Received: by relay.butya.kz (Postfix, from userid 1000)
	id 4CC732880F; Thu, 24 Aug 2000 17:39:39 +0700 (ALMST)
Received: from localhost (localhost [127.0.0.1])
	by relay.butya.kz (Postfix) with ESMTP
	id 4069C2880D; Thu, 24 Aug 2000 17:39:39 +0700 (ALMST)
Date: Thu, 24 Aug 2000 17:39:39 +0700 (ALMST)
From: Boris Popov <bp@butya.kz>
To: freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org,
	Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
Subject: Proposal to include iconv library in the base system.
Message-ID: <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-i18n@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

[This message cc'ed to -i18n which has had zero activity in the last month.]

Proposal to include iconv library and iconv(1) program in the base system.

This library of functions and its companion iconv program provide
converts between various single-byte and multibyte charsets.  These iconv* 
functions are essential in the mixed networks and on local machines with 
multiple charsets.

FreeBSD already contains a few character conversion schemes for
msdosfs, nwfs, cd9660fs and syscon mapping tables.  However, the usage
of these tables is not standardized and only providing support for a small
number of character sets.  Many external packages like KDE and GNOME also 
rely on the iconv functions.

Konstantin Chuguev wrote the original code in BSDL and I modified it
slightly.

OpenGroup has a description of iconv functions online:
http://www.opengroup.org/onlinepubs/7908799/xsh/iconv.html

A brief overview of character sets is available at:
http://www.austin.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/codeset_over.htm


Short Introduction on Library Design and Implementation:

The library consists of a core part, the Character Encoding Scheme (CES)
modules and the Character Conversion Scheme (CCS) modules.  Core part
contains exposed user functions and the internal framework for modules.  
To provide the maximum number of supported character set combinations,
this library uses unicode as the intermediate charset.  CES and CCS
modules contains conversion logic and conversion tables to map characters
between unicode and the target charset.  The entire character conversion
process looks like this:

	charset1 -> unicode -> charset2

In addition, it is possible to perform conversion only to/from unicode.

Modules are implemented as shared libraries and loaded via the dlopen()
function.  Modules reside in the /usr/lib/iconv/ directory and can be
dynamically added to system.

To make iconv subsystem more flexible, it has a "converter" layer
which allows the addition of more various converters.  Given
two arbitrarily chosen charsets charset1 and charset2, the converter 
allows programs to "open,"  then to perform conversion, and to close the 
process while release resources. For now library have only so called
Unicode Converter (UC).

For example, it is possible to write a XLAT converter which will
support direct, table based conversion between known characters sets. Of
course, a new converter can use its own modules.

Since support for multiple characters sets is also required in the
kernel, there is a kernel part which provides nearly the same set of
function in the kernel space.  Conversion tables uploaded to kernel memory
via sysctl interface from corresponding userland modules (no code, only
data).

The questionable part is a which set of character sets should be
included in the base system and which should be supplied as packages.
Obviously, conversion tables occupy 99% of the space:

	Part Name		Size of source code
	---------		-------------------
	Libray					83K

	Base character sets		       218K
	(ISO-8859*, cp8??, windows-125?)

	CJK				      5548K
	(big5, cns*, gb*, jis*, cp9??)

	RFC1345 character sets		      1064K

	Unicode character sets		       711K
	-------------------------------------------


Secondly, where should the functions be placed? Initially, the iconv
library was a separate file (libiconv*).  However, it seems that
Solaris has the library in libc and Linux in glibc.  I do not
know how HPUX does this.

And the third question is where I should place the source code for 
character conversion schemes in the source tree.

Of course, to respect maintaners of embedded systems and those who have
to deal with only one charset, option 'NO_ICONV' will be prvoided.

I would appreciate any feedback on this topic.

P.S. sources of libiconv in its current state available at
http://www.butya.kz/~bp/inode/

-------------------------------------------- 

Michael C. Wu (keichii@iteration.net) reviewed my proposal before it has
been posted and made some comments:

1) Does this allow for small patchsets to the character tables?
   i.e.  UNICODE does not completely map to BIG-5. 
         Some implementations map the differences directly to 
         blank space, while others map to equivalent characters.
         Depending on the user's choice, one should be able
         the specify a small change to the charset table without
         disruption.

I think it is possible and author probably already know the way :)

2) We should include all EUC and ISO charsets, even if they are
   sometimes totally unused to conform to standards.

3) I suggest having the character tables in /usr/libdata.
   /usr/share should have a directory that contains the mappings also.
   For the kernel, perhaps we should have a src/sys/i18n 
   and put iconv into src/sys/i18n/iconv.
   As to the libc code, to avoid ports compiling and patching
   trouble, we should follow what linux does in libc.

4) I do not think the charset tables will be bigger than 15mb total.
	

--
Boris Popov
http://www.butya.kz/~bp/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-i18n" in the body of the message


From owner-freebsd-i18n  Thu Aug 24 15:26: 4 2000
Delivered-To: freebsd-i18n@freebsd.org
Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178])
	by hub.freebsd.org (Postfix) with ESMTP
	id E551237B43C; Thu, 24 Aug 2000 15:26:01 -0700 (PDT)
Received: by peorth.iteration.net (Postfix, from userid 1000)
	id AEF6F64C2E; Thu, 24 Aug 2000 17:26:01 -0500 (CDT)
Date: Thu, 24 Aug 2000 17:26:01 -0500
From: "Michael C. Wu" <keichii@peorth.iteration.net>
To: Boris Popov <bp@butya.kz>
Cc: freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org,
	Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
Subject: Re: Proposal to include iconv library in the base system.
Message-ID: <20000824172601.A1353@peorth.iteration.net>
Mail-Followup-To: "Michael C. Wu" <keichii@peorth.iteration.net>,
	Boris Popov <bp@butya.kz>, freebsd-arch@freebsd.org,
	freebsd-i18n@freebsd.org,
	Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
References: <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz>; from bp@butya.kz on Thu, Aug 24, 2000 at 05:39:39PM +0700
X-FreeBSD-Header: This is a subliminal message from the vast FreeBSD conspiracy project.
X-Operating-System: FreeBSD peorth.iteration.net 4.1-STABLE FreeBSD 4.1-STABLE
Sender: owner-freebsd-i18n@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov scribbled:
| [This message cc'ed to -i18n which has had zero activity in the last month.]

Must....add traffic
|
| 4) I do not think the charset tables will be bigger than 15mb total.
---end quoted text---
How about having two sets of conversion tables?

One set would be the plain text files and the other would be
in Berkeley DB format to allow for faster system look-up's.


--
+------------------------------------------------------------------+
| keichii@peorth.iteration.net         | keichii@bsdconspiracy.net |
| http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. |
+------------------------------------------------------------------+


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-i18n" in the body of the message


From owner-freebsd-i18n  Fri Aug 25  5:36:11 2000
Delivered-To: freebsd-i18n@freebsd.org
Received: from alpha.dante.org.uk (alpha.dante.org.uk [193.63.211.19])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8A35C37B424; Fri, 25 Aug 2000 05:36:06 -0700 (PDT)
Received: from theta.dante.org.uk ([193.63.211.7])
	by alpha.dante.org.uk with esmtp (Exim 3.12 #4)
	id 13SIi4-0006VZ-00; Fri, 25 Aug 2000 13:35:48 +0100
Received: from localhost ([127.0.0.1] helo=dante.org.uk)
	by theta.dante.org.uk with esmtp (Exim 3.12 #4)
	id 13SIhu-0004wk-00; Fri, 25 Aug 2000 13:35:38 +0100
Message-ID: <39A6681A.E9337835@dante.org.uk>
Date: Fri, 25 Aug 2000 13:35:38 +0100
From: Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
Organization: Delivery of Advanced Networking Service to Europe Ltd.
X-Mailer: Mozilla 4.73 [en] (X11; I; SunOS 5.6 sun4u)
X-Accept-Language: en, ru
MIME-Version: 1.0
To: "Michael C. Wu" <keichii@peorth.iteration.net>
Cc: Boris Popov <bp@butya.kz>, freebsd-arch@freebsd.org,
	freebsd-i18n@freebsd.org
Subject: Re: Proposal to include iconv library in the base system.
References: <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz> <20000824172601.A1353@peorth.iteration.net>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-i18n@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

"Michael C. Wu" wrote:

> On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov scribbled:
> |
> | 4) I do not think the charset tables will be bigger than 15mb total.
> ---end quoted text---
> How about having two sets of conversion tables?
>
> One set would be the plain text files and the other would be
> in Berkeley DB format to allow for faster system look-up's.
>

The current charset table format is produced as follows:

There are two sources for table files: plain text tables from the Unicode WWW/FTP
site, and RFC1345. All this stuff is located in the developer's internal
directory.
C source files for every charset are created from the table files by a Perl
script. This is done before creating the source code distribution package.
It is OK to provide the Perl script in the distribution, but I see no reason to
use it every time when compiling the sources.

The produced C files contain conversion tables for both conversions from Unicode
and to Unicode; each conversion table can be either array of 128 or 256 unsigned
shorts, or the array of pointers to arrays of 128/256 ushorts. Each file also
contains internal functions and a few structures for "virtual methods".
C files are smaller than corresponding Unicode files, but bigger than
corresponding RFC1345 entries.

.so dynamic loadable modules are made at compile time from the C files. They are
much smaller than the C files.

I believe that the resulting .so files are smaller than corresponding DB tables
would be, and even smaller than CDB "constant database" files. The lookup is also
faster, but what is more important, it is not necessarily based on any tables. As
I said before, the conversion to/from Unicode is done by calling a (virtual)
method in a .so module.
This will allow us, for example, to create easily a new conversion table module
for mapping between CJK standard charsets and a new Unicode version containing
all ideographs from the Kangxi Dictionary and some other CJK characters
(http://www.unicode.org/unicode/alloc/Pipeline.html). As the new characters
occupy Plane 2 of ISO-10646, we will need another format for the table. But this
will not affect the library and other modules.

--
          * *        Konstantin Chuguev - Application Engineer
       *      *              Francis House, 112 Hills Road
     *                       Cambridge CB2 1PQ, United Kingdom
 D  A  N  T  E       WWW:    http://www.dante.net


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-i18n" in the body of the message