From owner-freebsd-i18n Thu Aug 24 3:48:25 2000 Delivered-To: freebsd-i18n@freebsd.org Received: from relay.butya.kz (butya-gw.butya.kz [212.154.129.94]) by hub.freebsd.org (Postfix) with ESMTP id 1A77537B423; Thu, 24 Aug 2000 03:48:15 -0700 (PDT) Received: by relay.butya.kz (Postfix, from userid 1000) id 4CC732880F; Thu, 24 Aug 2000 17:39:39 +0700 (ALMST) Received: from localhost (localhost [127.0.0.1]) by relay.butya.kz (Postfix) with ESMTP id 4069C2880D; Thu, 24 Aug 2000 17:39:39 +0700 (ALMST) Date: Thu, 24 Aug 2000 17:39:39 +0700 (ALMST) From: Boris Popov To: freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org, Konstantin Chuguev Subject: Proposal to include iconv library in the base system. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [This message cc'ed to -i18n which has had zero activity in the last month.] Proposal to include iconv library and iconv(1) program in the base system. This library of functions and its companion iconv program provide converts between various single-byte and multibyte charsets. These iconv* functions are essential in the mixed networks and on local machines with multiple charsets. FreeBSD already contains a few character conversion schemes for msdosfs, nwfs, cd9660fs and syscon mapping tables. However, the usage of these tables is not standardized and only providing support for a small number of character sets. Many external packages like KDE and GNOME also rely on the iconv functions. Konstantin Chuguev wrote the original code in BSDL and I modified it slightly. OpenGroup has a description of iconv functions online: http://www.opengroup.org/onlinepubs/7908799/xsh/iconv.html A brief overview of character sets is available at: http://www.austin.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/codeset_over.htm Short Introduction on Library Design and Implementation: The library consists of a core part, the Character Encoding Scheme (CES) modules and the Character Conversion Scheme (CCS) modules. Core part contains exposed user functions and the internal framework for modules. To provide the maximum number of supported character set combinations, this library uses unicode as the intermediate charset. CES and CCS modules contains conversion logic and conversion tables to map characters between unicode and the target charset. The entire character conversion process looks like this: charset1 -> unicode -> charset2 In addition, it is possible to perform conversion only to/from unicode. Modules are implemented as shared libraries and loaded via the dlopen() function. Modules reside in the /usr/lib/iconv/ directory and can be dynamically added to system. To make iconv subsystem more flexible, it has a "converter" layer which allows the addition of more various converters. Given two arbitrarily chosen charsets charset1 and charset2, the converter allows programs to "open," then to perform conversion, and to close the process while release resources. For now library have only so called Unicode Converter (UC). For example, it is possible to write a XLAT converter which will support direct, table based conversion between known characters sets. Of course, a new converter can use its own modules. Since support for multiple characters sets is also required in the kernel, there is a kernel part which provides nearly the same set of function in the kernel space. Conversion tables uploaded to kernel memory via sysctl interface from corresponding userland modules (no code, only data). The questionable part is a which set of character sets should be included in the base system and which should be supplied as packages. Obviously, conversion tables occupy 99% of the space: Part Name Size of source code --------- ------------------- Libray 83K Base character sets 218K (ISO-8859*, cp8??, windows-125?) CJK 5548K (big5, cns*, gb*, jis*, cp9??) RFC1345 character sets 1064K Unicode character sets 711K ------------------------------------------- Secondly, where should the functions be placed? Initially, the iconv library was a separate file (libiconv*). However, it seems that Solaris has the library in libc and Linux in glibc. I do not know how HPUX does this. And the third question is where I should place the source code for character conversion schemes in the source tree. Of course, to respect maintaners of embedded systems and those who have to deal with only one charset, option 'NO_ICONV' will be prvoided. I would appreciate any feedback on this topic. P.S. sources of libiconv in its current state available at http://www.butya.kz/~bp/inode/ -------------------------------------------- Michael C. Wu (keichii@iteration.net) reviewed my proposal before it has been posted and made some comments: 1) Does this allow for small patchsets to the character tables? i.e. UNICODE does not completely map to BIG-5. Some implementations map the differences directly to blank space, while others map to equivalent characters. Depending on the user's choice, one should be able the specify a small change to the charset table without disruption. I think it is possible and author probably already know the way :) 2) We should include all EUC and ISO charsets, even if they are sometimes totally unused to conform to standards. 3) I suggest having the character tables in /usr/libdata. /usr/share should have a directory that contains the mappings also. For the kernel, perhaps we should have a src/sys/i18n and put iconv into src/sys/i18n/iconv. As to the libc code, to avoid ports compiling and patching trouble, we should follow what linux does in libc. 4) I do not think the charset tables will be bigger than 15mb total. -- Boris Popov http://www.butya.kz/~bp/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Thu Aug 24 15:26: 4 2000 Delivered-To: freebsd-i18n@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id E551237B43C; Thu, 24 Aug 2000 15:26:01 -0700 (PDT) Received: by peorth.iteration.net (Postfix, from userid 1000) id AEF6F64C2E; Thu, 24 Aug 2000 17:26:01 -0500 (CDT) Date: Thu, 24 Aug 2000 17:26:01 -0500 From: "Michael C. Wu" To: Boris Popov Cc: freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org, Konstantin Chuguev Subject: Re: Proposal to include iconv library in the base system. Message-ID: <20000824172601.A1353@peorth.iteration.net> Mail-Followup-To: "Michael C. Wu" , Boris Popov , freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org, Konstantin Chuguev References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from bp@butya.kz on Thu, Aug 24, 2000 at 05:39:39PM +0700 X-FreeBSD-Header: This is a subliminal message from the vast FreeBSD conspiracy project. X-Operating-System: FreeBSD peorth.iteration.net 4.1-STABLE FreeBSD 4.1-STABLE Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov scribbled: | [This message cc'ed to -i18n which has had zero activity in the last month.] Must....add traffic | | 4) I do not think the charset tables will be bigger than 15mb total. ---end quoted text--- How about having two sets of conversion tables? One set would be the plain text files and the other would be in Berkeley DB format to allow for faster system look-up's. -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message From owner-freebsd-i18n Fri Aug 25 5:36:11 2000 Delivered-To: freebsd-i18n@freebsd.org Received: from alpha.dante.org.uk (alpha.dante.org.uk [193.63.211.19]) by hub.freebsd.org (Postfix) with ESMTP id 8A35C37B424; Fri, 25 Aug 2000 05:36:06 -0700 (PDT) Received: from theta.dante.org.uk ([193.63.211.7]) by alpha.dante.org.uk with esmtp (Exim 3.12 #4) id 13SIi4-0006VZ-00; Fri, 25 Aug 2000 13:35:48 +0100 Received: from localhost ([127.0.0.1] helo=dante.org.uk) by theta.dante.org.uk with esmtp (Exim 3.12 #4) id 13SIhu-0004wk-00; Fri, 25 Aug 2000 13:35:38 +0100 Message-ID: <39A6681A.E9337835@dante.org.uk> Date: Fri, 25 Aug 2000 13:35:38 +0100 From: Konstantin Chuguev Organization: Delivery of Advanced Networking Service to Europe Ltd. X-Mailer: Mozilla 4.73 [en] (X11; I; SunOS 5.6 sun4u) X-Accept-Language: en, ru MIME-Version: 1.0 To: "Michael C. Wu" Cc: Boris Popov , freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org Subject: Re: Proposal to include iconv library in the base system. References: <20000824172601.A1353@peorth.iteration.net> Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG "Michael C. Wu" wrote: > On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov scribbled: > | > | 4) I do not think the charset tables will be bigger than 15mb total. > ---end quoted text--- > How about having two sets of conversion tables? > > One set would be the plain text files and the other would be > in Berkeley DB format to allow for faster system look-up's. > The current charset table format is produced as follows: There are two sources for table files: plain text tables from the Unicode WWW/FTP site, and RFC1345. All this stuff is located in the developer's internal directory. C source files for every charset are created from the table files by a Perl script. This is done before creating the source code distribution package. It is OK to provide the Perl script in the distribution, but I see no reason to use it every time when compiling the sources. The produced C files contain conversion tables for both conversions from Unicode and to Unicode; each conversion table can be either array of 128 or 256 unsigned shorts, or the array of pointers to arrays of 128/256 ushorts. Each file also contains internal functions and a few structures for "virtual methods". C files are smaller than corresponding Unicode files, but bigger than corresponding RFC1345 entries. .so dynamic loadable modules are made at compile time from the C files. They are much smaller than the C files. I believe that the resulting .so files are smaller than corresponding DB tables would be, and even smaller than CDB "constant database" files. The lookup is also faster, but what is more important, it is not necessarily based on any tables. As I said before, the conversion to/from Unicode is done by calling a (virtual) method in a .so module. This will allow us, for example, to create easily a new conversion table module for mapping between CJK standard charsets and a new Unicode version containing all ideographs from the Kangxi Dictionary and some other CJK characters (http://www.unicode.org/unicode/alloc/Pipeline.html). As the new characters occupy Plane 2 of ISO-10646, we will need another format for the table. But this will not affect the library and other modules. -- * * Konstantin Chuguev - Application Engineer * * Francis House, 112 Hills Road * Cambridge CB2 1PQ, United Kingdom D A N T E WWW: http://www.dante.net To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message