Date: Tue, 29 Sep 2009 08:00:45 +0000 (UTC) From: Edwin Groothuis <edwin@FreeBSD.org> To: src-committers@freebsd.org, svn-src-user@freebsd.org Subject: svn commit: r197610 - user/edwin/locale Message-ID: <200909290800.n8T80jAq078943@svn.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: edwin Date: Tue Sep 29 08:00:45 2009 New Revision: 197610 URL: http://svn.freebsd.org/changeset/base/197610 Log: This is kind of progress report / manual / background etc. Should go in the Wiki too. Added: user/edwin/locale/README.locale Added: user/edwin/locale/README.locale ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ user/edwin/locale/README.locale Tue Sep 29 08:00:45 2009 (r197610) @@ -0,0 +1,188 @@ +New approach to the FreeBSD locale database +=========================================== + +Background +---------- +Over the years the FreeBSD locale database (share/colldef, +share/monetdef, share/msgdef, share/numericdef, share/timedef) has +accumulated a total of 165 definitions (language - country-code - +character-set triplets). The contents of the files is for Western +European languages often low-ASCII but for Eastern European and +Asian languages partly or fully high-ASCII. Without knowing how to +display or interpret the character-sets, it is difficult to make +sure by the general audience that the local languages (language - +country-code) definitions is displayed properly in various +character-sets. + + +Solution +-------- +With a per definition (language - country-code) low-ASCII file with +the definitions of the characters for the fields, it would be +possible to generate the various character-sets for that language. + + +What do we need +--------------- +- A database with all character encoding definitions. The Unicode + Project defines these. +- An intermittent format which can be used to convert these encodings + into unique characters. The UTF-8 character-set supports this. +- A tool to convert from the intermittent format into the various + character-sets. Libiconv (GPL) and bsdiconv (BSDL) can do this. +- A Makefile which glues everything together. + + +Gotchas +------- +- Some countries do not only have multiple languages (nl_BE and + fr_BE for example), but some of them have also different font + families: sr_Cyrl_RS and sr_Latn_RS. +- Duplicate detection has always been a manual thing and is tricky + to do initially. Right now this keeps being the job of the + maintainers of the locale data in the SCM repository. + + +Examples +-------- + +The word for the last day of the week in the en_US language - country +code would be in Unicode format: + <LATIN CAPITAL LETTER S><LATIN SMALL LETTER U> + <LATIN SMALL LETTER N><LATIN SMALL LETTER D> + <LATIN SMALL LETTER A><LATIN SMALL LETTER Y> +Converted into UTF-8 this will be: + Sunday +Converted into ISO-8859 this will be: + Sunday + +The word for the last day of the week in the ru_RU language - +country code would be in Unicode format: + <CYRILLIC SMALL LETTER ES><CYRILLIC SMALL LETTER U> + <CYRILLIC SMALL LETTER BE><CYRILLIC SMALL LETTER BE> + <CYRILLIC SMALL LETTER O><CYRILLIC SMALL LETTER TE> + <CYRILLIC SMALL LETTER A> +Converted into UTF-8 this will be: + <D1><81><D1><83><D0><B1><D0><B1><D0><BE><D1><82><D0><B0> +Converted into KOI8-R this will be: + <D3><D5><C2><C2><CF><D4><C1> + + +Careful! +-------- +- In the timedef definitions, do not convert the %A into Unicode + format because the %A is a low-ASCII input for strftime(). Also + don't put the md_order in Unicode format because that is a low-ASCII + definition. +- libiconv doesn't understand ISCII-DEV, bsdiconv calls it macdevanaga. +- Backwards compatibility: There are a bunch of old or obsolete + names in the FreeBSD locale definitions (sr_YU -> sr_Cyrl_RS and + sr_Latn_RS, zh_HK -> zh_Hant_HK, zh_CN -> zh_Hans_CN) which still + might be needed. + + +Current status +-------------- + +Finished: +- Conversion of the current locale data into the Unicode format for + share/monetdef, share/msgdef, share/numericdef, share/timedef. +- Conversion of the current Makefiles to support the new approach. + It also adds the file src/share/Makefile.def.inc which does do + the magic between the definitions in the Makefile and the FreeBSD + bsd.*.mk. Done for share/colldef, share/monetdef, share/msgdef, + share/numericdef, share/timedef. +- Regression check. +- Conversion of the Unicode definitions to the UTF-8 character-set. + +Pending: +- Checking of the data with the CLDR (Common Locale Data Repository) + for completeness of the current data. +- Conversion of Makefiles for share/mklocale. +- Conversion of the Unicode definitions to the UTF-8 character-set + in a C program or AWK script to make it self-hosting. This is + right now a Perl script so it can't be part of the base OS build + yet. This tool for now lives in src/tools/tools/locale/. +- Import of the file UTF-8.cm (from the CLDR project) and the file + UnicodeData.txt (from the Unicode project) into the base operating + system. These files for now live in src/tools/tools/locale/ + +Pending third parties: +- bsdiconv in the main system. + + +SCM +--- + +(Currently the SCM contains all the definitions (language - country-code +- character-set) in low and high-ASCII. To keep the SCM history, we +will once move them to their .unicode extension and then overwrite +them with the Unicode encoding definitions) + +The .unicode files are stored in SCM and will be, in the long term, +be the only source in SCM. Right now due to lack of bsdiconv in the +base operating system we will have to store also the character-map +sources (.src) files into the SCM. Once bsdiconv is in the base +system these files can be removed and the whole database can be +made self-hosted. + + +Testing (before move to src/tools/tools/locale) +----------------------------------------------- + +To test the current system, you need the following data: + +- A copy of the CLDR, available from http://cldr.unicode.org/. + Currently version 1.7.1 is used. We only use the file posix/UTF-8.cm + from it. +- A copy of the Unicode database, available from http://www.unicode.org/. + Currently version is 5.1.0. We only use the file UnicodeData.txt from it. +- A copy of svn://svn.freebsd.org/base/user/edwin/locale/. +- A copy of bsdiconv from p4:///depot/gabor/something. + +Local configuration: + +- Add to /etc/make.conf (make sure they match your directory layout) + CLDRDIR= /home/edwin/unicode/cldr/1.7.1 + UNIDATADIR= /home/edwin/unicode/UNIDATA/5.1.0 + TOOLSDIR= /home/edwin/svn/edwin/locale/cldr/tools/ + LOCALE_DESTDIR= /home/edwin/locale/new + LOCALE_SHAREOWN=edwin + LOCALE_SHAREGRP=edwin + +Test it out: + +- Go to the SVN directory /user/edwin/locale/share. The Makefile + there only includes the locale directories, so there is no need to + be worried about the other . + +- Run "FULL=1 make clean" to get rid of all generated files, even + the ones in the SCM. You should only have the *.unicode and the + Makefiles now. + +- Run "FULL=1 make" to recreate everything. + +- Run "make clean" to get rid of all data not in the SCM. + +- Run "make" to recreate the data not in the SCM. + +# +# All targets for TARGET_CHARACTERMAP +# +# .unicode -> .utf-8.src -> .utf-8.out +# \__ .iso8859-1.src -> .iso8859-1.out +# <----1---><--2---><------3--------><----4-----> +# +# 1. The files .unicode are stored in the SCM and are the source +# for the whole further system +# 2. The Perl script converts the .unicode files and the Unicode +# CLDR database into UTF-8 code +# 3. The UTF-8 gets converted by libiconv or bsdiconv in the specific +# character-map. +# 4. Get rid of the comments. +# +# As long as there is no bsdiconv, the files with the extension +# .unicode and .src must be stored in the SCM and will not be +# generated as part of the build process. +# +
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200909290800.n8T80jAq078943>