From owner-soc-status@FreeBSD.ORG  Sat Aug  2 01:18:34 2014
Return-Path: <owner-soc-status@FreeBSD.ORG>
Delivered-To: soc-status@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 7086D800;
 Sat,  2 Aug 2014 01:18:34 +0000 (UTC)
Received: from mail-lb0-x230.google.com (mail-lb0-x230.google.com
 [IPv6:2a00:1450:4010:c04::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9694A222B;
 Sat,  2 Aug 2014 01:18:33 +0000 (UTC)
Received: by mail-lb0-f176.google.com with SMTP id u10so3754206lbd.35
 for <multiple recipients>; Fri, 01 Aug 2014 18:18:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:message-id:date:user-agent:mime-version:to:subject
 :content-type:content-transfer-encoding;
 bh=gsGnxL68OqIdVNT0qb4RPvblkZ2jX9GUX/VTYmrYPJ0=;
 b=dGuddY9DdL0XrMGRHsjcq6yYuldCtsU+UYTyKKf3X3oL1FRFk5Jm+0JPZStpO8Ycbg
 Pzus9bRuKgOK1HJQEJMgLHU+/dVlbRcKbbsr8CV1LHUHC8nlWZfVE6wKiGcbYbLLY8fp
 bdWau9nmD67Jl1SRT//DJsz75qRymCW8RG4BeRUSSiOumdyrT8c8djSW/fxzIrfI9K5k
 lU7KcQThj0W3L54q2qMGqCLs6nkfIex0eZT5iwNJz4LPGZTaYdPqOxagPcAiXHb1wULM
 eT65vYDXF0pfgfZdNvsSEvppLYcIk8HOv4UgIR3KjyrlfOBYFVgHgNLxx/KCHx3BxCWD
 AxHA==
X-Received: by 10.152.184.234 with SMTP id ex10mr9966836lac.53.1406942311160; 
 Fri, 01 Aug 2014 18:18:31 -0700 (PDT)
Received: from openSUSE.linux ([176.100.246.237])
 by mx.google.com with ESMTPSA id x10sm5618990lal.13.2014.08.01.18.18.30
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Fri, 01 Aug 2014 18:18:30 -0700 (PDT)
From: Dmitry Selyutin <ghostman.sd@gmail.com>
X-Google-Original-From: Dmitry Selyutin <ghostmansd@gmail.com>
Message-ID: <53DC3C41.7070105@gmail.com>
Date: Sat, 02 Aug 2014 05:17:53 +0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Pedro Giffuni <pfg@FreeBSD.org>, David Chisnall <theraven@freebsd.org>,
 soc-status@FreeBSD.org
Subject: Report #5
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-BeenThere: soc-status@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Summer of Code Status Reports and Discussion <soc-status.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/soc-status>,
 <mailto:soc-status-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/soc-status/>
List-Post: <mailto:soc-status@freebsd.org>
List-Help: <mailto:soc-status-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/soc-status>,
 <mailto:soc-status-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Aug 2014 01:18:34 -0000

Hello everyone!

Here is my report on progress that was achieved during this time. I've
implemented actual Unicode Collation Algorithm for DUCET (Default
Unicode Collation Element Table). I had to rewrite the entire
implementation: I wasn't satisfied with its quality and the way that
I've organized my source code, so I reverted my code and started again.
My previous implementation was full of hard-coded parts and it was a bit
harder to take anything useful from it for any other project. Now the
entire implementation is available in include/unicode.h and
lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll()
will use new collation algorithm. struct _xlocale was modified in the
way it will use two new members, colltable and collsize, which are just
transmitted to __ucscoll(). If element is not found in the given table
or table is NULL, then __ucscoll() tries to find this element in DUCET;
if element was not found, then __ucscoll generates collation.

I couldn't understand how the alternate shall be used though; it seems
that it can be dropped since wcscoll() doesn't has any version that
supports tailoring. I left it for now, but I'm pretty sure that we can
omit it.

I hadn't time to test wcscoll() better (especially using files provided
by Unicode Character Database), so this is the task that I will do right
now. :-) There are still several ways to improve the speed of the
algorithm, but I feel that the time for it hasn't come yet. style(9)
issues will also be handled (if any), just too tired to do it right now.

__ucscoll() just uses __ucsxfrm(), then compares the strings using
wcscmp() (this is the only platform-dependent part of code, I was too
lazy to write __ucslen(), so I left it as it is). This collation
algorithm support three levels; the last IIRC is usually the character
itself if not defined, so I decided to omit it (especially since I'm not
sure how variable weights should be handled). Any thoughs?

-- 
With best regards,
Dmitry Selyutin