From owner-freebsd-i18n Fri Sep 1 9:17:16 2000 Delivered-To: freebsd-i18n@freebsd.org Received: from alpha.dante.org.uk (alpha.dante.org.uk [193.63.211.19]) by hub.freebsd.org (Postfix) with ESMTP id E305037B42C; Fri, 1 Sep 2000 09:17:07 -0700 (PDT) Received: from theta.dante.org.uk ([193.63.211.7]) by alpha.dante.org.uk with esmtp (Exim 3.12 #4) id 13UtUg-000694-00; Fri, 01 Sep 2000 17:16:42 +0100 Received: from localhost ([127.0.0.1] helo=dante.org.uk) by theta.dante.org.uk with esmtp (Exim 3.12 #4) id 13UtUd-00073s-00; Fri, 01 Sep 2000 17:16:39 +0100 Message-ID: <39AFD666.880FE6C@dante.org.uk> Date: Fri, 01 Sep 2000 17:16:38 +0100 From: Konstantin Chuguev Organization: Delivery of Advanced Networking Service to Europe Ltd. X-Mailer: Mozilla 4.75 [en] (X11; U; SunOS 5.6 sun4u) X-Accept-Language: en, ru MIME-Version: 1.0 To: "Andrey A. Chernov" Cc: Boris Popov , freebsd-arch@FreeBSD.ORG, freebsd-i18n@FreeBSD.ORG Subject: Re: Proposal to include iconv library in the base system. References: <20000901185945.A29804@nagual.pp.ru> Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Sender: owner-freebsd-i18n@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG "Andrey A. Chernov" wrote: > On Thu, Aug 24, 2000 at 05:39:39PM +0700, Boris Popov wrote: > > FreeBSD already contains a few character conversion schemes for > > msdosfs, nwfs, cd9660fs and syscon mapping tables. However, the usage > > We need XLAT converters for them, not Unicode one, as I understand Unicode > data loaded into kernel will be too big. > It depends on what do you mean by Unicode data. At the Unicode site there is a plain/text table with Unicode data, with the number of recored approximately equal 0xFFFF - . Each record corresponds to a Unicode character and can have up to 10 or more fields, among them the canonical name of the character, information about capital and small letters, directionality and so on. This information is intended to be used for processing/[de]normalizing Unicode text. All this is not needed for the charset conversion. The most commonly used 8-bit charsets for filesystems are ISO-8859, Windows-125x, IBM-86x and KOI8-R. It is easy to create XLAT tables for conversion between pairs of these charsets; obviously we won't need the "full mesh" of pairs here, only tables for charsets used for the same language. Simplifying that, we will need N * 2 * 256bytes for all tables. There are two problems here: * it is not so easy to guess the N number. For 4 charsets used for Russian we will need 6 * 2 tables (for 5 charsets - 10 * 2 tables); add other languages; * new filesystems use Unicode encodings: UCS-2 (Windows), some may use UTF-8. These encodings are not supported by XLAT. iconv CCS modules consist of 2 tables each. One table is for translation from the charset to Unicode (UCS-4), the other - from UCS-4 to the charset. There are 4 different table types currently supported: 7-bit, 8-bit, 14-bit and 16-bit. The table layout is hidden from the module interface, two functions actually doing the conversion are as follows (names of functions and arguments here don't match exactly those in the iconv implementation): ucs4_t iconv_ccs_convert_to_ucs(void *module, ucs4_t charset_char); ucs4_t iconv_ccs_convert_from_ucs(void *module, ucs4_t ucs_char); The internal 14-bit and 16-bit tables are two-level, not flat. There is one more type of CCS modules I'm thinking about: 32-bit tables for translation between full range of CJK characters in, say, BIG5 or CNS11643, and the next version of Unicode standard with these character added. 1. charsets for right-to-left scripts; they use special control characters for changing the direction of writing; the algorithm is different from the Unicode one; more complicated logic is necessary for these charsets. The second type of modules is CES - character encoding schemes. Their interface is similar to that below: ucs4_t ccs_convert_to_ucs(void *module, unsigned char **srcstr, unsigned *srcbytelen); int ccs_convert_from_ucs(void *module, unsigned char **dststr, unsigned *dstbytelen, ucs4_t srcchar); The difference from CCS is that there is no fixed-length correspondense between the UCS and original charset characters. Each UCS character can be translated to/from 0 to N bytes of a text encoded in the original character encoding scheme. Now the supported schemes are: * _tbl_simple - used for most European charsets (for ASCII and all 8-bit charsets); it simply uses the corresponding CCS module; * EUC family for CJK; * ISO-2022-xx for CJK; * UCS-4, UCS-2, UTF-16, UTF-8, UTF-7. A new type of CES modules can be the one for charsets used for Arabic and Hebrew, where more complicated algorithm used to convert directionality control characters from/to Unicode. Now, all the modules are loadable and shareable. If the system is using a fair amount of charsets at the same time, the amount of table data loaded into kernel can be actually smaller than when loading all corresponding pairs of XLAT tables. > > > The questionable part is a which set of character sets should be > > included in the base system and which should be supplied as packages. > > We need to include all charsets we have locale support in the base system. > Exactly, this is what was intended. All [UNIX] charsets supported in the FreeBSD distribution (i.e. which are present in the locale directory) PLUS charsets used in other types of filesystems (Windows, Netware?, MacOS?) for the languages supported by FreeBSD (see locale again). Otherwise there is no much need to include iconv to the kernel at the moment. Perhaps, minus CJK charsets due to their size. I don't know if there is a need of CJK charset conversion for filesystems. All other modules can easily be installed from ports/packages. > > > Secondly, where should the functions be placed? Initially, the iconv > > /usr/libdata/iconv > I think this case is much the same as for PAM modules. > > What I am not understand at this moment: how iconv handles non-convertable > characters? I don't see any way to set fill character in described > interface. > According to the standard, iconv stops when it finds an illegal sequence of bytes in the source byte sequence (input charset). If there is no corresponding character in the destination charset, the behaviour of iconv is implementation-dependent. My implementation currently translates it into a predefined (at compile time) substitution character. Don't remember though whether it is '_' or '?' :-) I will try my best to produce the final version 1.0 of the library and conversion modules before Monday. -- * * Konstantin Chuguev - Application Engineer * * Francis House, 112 Hills Road * Cambridge CB2 1PQ, United Kingdom D A N T E WWW: http://www.dante.net To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-i18n" in the body of the message