Date: Sat, 02 Aug 2014 05:22:38 +0400 From: Dmitry Selyutin <ghostman.sd@gmail.com> To: Pedro Giffuni <pfg@FreeBSD.org>, David Chisnall <theraven@freebsd.org>, soc-status@FreeBSD.org Subject: Report #5: Unicode support Message-ID: <53DC3D5E.5080909@gmail.com> In-Reply-To: <53DC3C41.7070105@gmail.com> References: <53DC3C41.7070105@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Sorry, I've forgotten to modified theme according to rules. Sending this message again so anyone can find it more easy. Sorry for being annoying. Hello everyone! Here is my report on progress that was achieved during this time. I've implemented actual Unicode Collation Algorithm for DUCET (Default Unicode Collation Element Table). I had to rewrite the entire implementation: I wasn't satisfied with its quality and the way that I've organized my source code, so I reverted my code and started again. My previous implementation was full of hard-coded parts and it was a bit harder to take anything useful from it for any other project. Now the entire implementation is available in include/unicode.h and lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll() will use new collation algorithm. struct _xlocale was modified in the way it will use two new members, colltable and collsize, which are just transmitted to __ucscoll(). If element is not found in the given table or table is NULL, then __ucscoll() tries to find this element in DUCET; if element was not found, then __ucscoll generates collation. I couldn't understand how the alternate shall be used though; it seems that it can be dropped since wcscoll() doesn't has any version that supports tailoring. I left it for now, but I'm pretty sure that we can omit it. I hadn't time to test wcscoll() better (especially using files provided by Unicode Character Database), so this is the task that I will do right now. :-) There are still several ways to improve the speed of the algorithm, but I feel that the time for it hasn't come yet. style(9) issues will also be handled (if any), just too tired to do it right now. __ucscoll() just uses __ucsxfrm(), then compares the strings using wcscmp() (this is the only platform-dependent part of code, I was too lazy to write __ucslen(), so I left it as it is). This collation algorithm support three levels; the last IIRC is usually the character itself if not defined, so I decided to omit it (especially since I'm not sure how variable weights should be handled). Any thoughs? -- With best regards, Dmitry Selyutin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53DC3D5E.5080909>