Date: Thu, 24 Aug 2000 17:39:39 +0700 (ALMST) From: Boris Popov <bp@butya.kz> To: freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org, Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk> Subject: Proposal to include iconv library in the base system. Message-ID: <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz>
next in thread | raw e-mail | index | archive | help
[This message cc'ed to -i18n which has had zero activity in the last month.] Proposal to include iconv library and iconv(1) program in the base system. This library of functions and its companion iconv program provide converts between various single-byte and multibyte charsets. These iconv* functions are essential in the mixed networks and on local machines with multiple charsets. FreeBSD already contains a few character conversion schemes for msdosfs, nwfs, cd9660fs and syscon mapping tables. However, the usage of these tables is not standardized and only providing support for a small number of character sets. Many external packages like KDE and GNOME also rely on the iconv functions. Konstantin Chuguev wrote the original code in BSDL and I modified it slightly. OpenGroup has a description of iconv functions online: http://www.opengroup.org/onlinepubs/7908799/xsh/iconv.html A brief overview of character sets is available at: http://www.austin.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/codeset_over.htm Short Introduction on Library Design and Implementation: The library consists of a core part, the Character Encoding Scheme (CES) modules and the Character Conversion Scheme (CCS) modules. Core part contains exposed user functions and the internal framework for modules. To provide the maximum number of supported character set combinations, this library uses unicode as the intermediate charset. CES and CCS modules contains conversion logic and conversion tables to map characters between unicode and the target charset. The entire character conversion process looks like this: charset1 -> unicode -> charset2 In addition, it is possible to perform conversion only to/from unicode. Modules are implemented as shared libraries and loaded via the dlopen() function. Modules reside in the /usr/lib/iconv/ directory and can be dynamically added to system. To make iconv subsystem more flexible, it has a "converter" layer which allows the addition of more various converters. Given two arbitrarily chosen charsets charset1 and charset2, the converter allows programs to "open," then to perform conversion, and to close the process while release resources. For now library have only so called Unicode Converter (UC). For example, it is possible to write a XLAT converter which will support direct, table based conversion between known characters sets. Of course, a new converter can use its own modules. Since support for multiple characters sets is also required in the kernel, there is a kernel part which provides nearly the same set of function in the kernel space. Conversion tables uploaded to kernel memory via sysctl interface from corresponding userland modules (no code, only data). The questionable part is a which set of character sets should be included in the base system and which should be supplied as packages. Obviously, conversion tables occupy 99% of the space: Part Name Size of source code --------- ------------------- Libray 83K Base character sets 218K (ISO-8859*, cp8??, windows-125?) CJK 5548K (big5, cns*, gb*, jis*, cp9??) RFC1345 character sets 1064K Unicode character sets 711K ------------------------------------------- Secondly, where should the functions be placed? Initially, the iconv library was a separate file (libiconv*). However, it seems that Solaris has the library in libc and Linux in glibc. I do not know how HPUX does this. And the third question is where I should place the source code for character conversion schemes in the source tree. Of course, to respect maintaners of embedded systems and those who have to deal with only one charset, option 'NO_ICONV' will be prvoided. I would appreciate any feedback on this topic. P.S. sources of libiconv in its current state available at http://www.butya.kz/~bp/inode/ -------------------------------------------- Michael C. Wu (keichii@iteration.net) reviewed my proposal before it has been posted and made some comments: 1) Does this allow for small patchsets to the character tables? i.e. UNICODE does not completely map to BIG-5. Some implementations map the differences directly to blank space, while others map to equivalent characters. Depending on the user's choice, one should be able the specify a small change to the charset table without disruption. I think it is possible and author probably already know the way :) 2) We should include all EUC and ISO charsets, even if they are sometimes totally unused to conform to standards. 3) I suggest having the character tables in /usr/libdata. /usr/share should have a directory that contains the mappings also. For the kernel, perhaps we should have a src/sys/i18n and put iconv into src/sys/i18n/iconv. As to the libc code, to avoid ports compiling and patching trouble, we should follow what linux does in libc. 4) I do not think the charset tables will be bigger than 15mb total. -- Boris Popov http://www.butya.kz/~bp/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.10008241719320.80086-100000>