Date: Sat, 02 Aug 2014 05:17:53 +0400 From: Dmitry Selyutin <ghostman.sd@gmail.com> To: Pedro Giffuni <pfg@FreeBSD.org>, David Chisnall <theraven@freebsd.org>, soc-status@FreeBSD.org Subject: Report #5 Message-ID: <53DC3C41.7070105@gmail.com>
next in thread | raw e-mail | index | archive | help
Hello everyone! Here is my report on progress that was achieved during this time. I've implemented actual Unicode Collation Algorithm for DUCET (Default Unicode Collation Element Table). I had to rewrite the entire implementation: I wasn't satisfied with its quality and the way that I've organized my source code, so I reverted my code and started again. My previous implementation was full of hard-coded parts and it was a bit harder to take anything useful from it for any other project. Now the entire implementation is available in include/unicode.h and lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll() will use new collation algorithm. struct _xlocale was modified in the way it will use two new members, colltable and collsize, which are just transmitted to __ucscoll(). If element is not found in the given table or table is NULL, then __ucscoll() tries to find this element in DUCET; if element was not found, then __ucscoll generates collation. I couldn't understand how the alternate shall be used though; it seems that it can be dropped since wcscoll() doesn't has any version that supports tailoring. I left it for now, but I'm pretty sure that we can omit it. I hadn't time to test wcscoll() better (especially using files provided by Unicode Character Database), so this is the task that I will do right now. :-) There are still several ways to improve the speed of the algorithm, but I feel that the time for it hasn't come yet. style(9) issues will also be handled (if any), just too tired to do it right now. __ucscoll() just uses __ucsxfrm(), then compares the strings using wcscmp() (this is the only platform-dependent part of code, I was too lazy to write __ucslen(), so I left it as it is). This collation algorithm support three levels; the last IIRC is usually the character itself if not defined, so I decided to omit it (especially since I'm not sure how variable weights should be handled). Any thoughs? -- With best regards, Dmitry Selyutin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53DC3C41.7070105>