From owner-soc-status@FreeBSD.ORG  Sat Aug  2 01:23:19 2014
Return-Path: <owner-soc-status@FreeBSD.ORG>
Delivered-To: soc-status@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id E3A1B869;
 Sat,  2 Aug 2014 01:23:18 +0000 (UTC)
Received: from mail-la0-x234.google.com (mail-la0-x234.google.com
 [IPv6:2a00:1450:4010:c03::234])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 113ED22C7;
 Sat,  2 Aug 2014 01:23:17 +0000 (UTC)
Received: by mail-la0-f52.google.com with SMTP id e16so3726589lan.39
 for <multiple recipients>; Fri, 01 Aug 2014 18:23:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:message-id:date:user-agent:mime-version:to:subject:references
 :in-reply-to:content-type:content-transfer-encoding;
 bh=pQ49q4VQ4PXP/AyzsPQZPVHc8gG1DRYaEBvjb1p7D+U=;
 b=QYc8qVX7YzVZq/202os8NyGO08CWwtLy3OZ3Vh6OoxIQqTtEwtmNNyvsbCKPzUV0ZS
 ISjB2TdT+E64nXJzokHBtJCBz0CR3mZAZK80GOXOY3Hp9mdVEFHWU66cGEa34MjjFYA4
 UATof0vCEfVemH1qBUtO43QOsW7vOc0JuxxHBCFM4JEDAZtFGCEvpf9Ko+Udnckm7Nr0
 DwUgY4MtYj52gayMHOo0T9cYbufQK7vhVq7Ywuwm5anYWuqa9WKxNeSBqMIfcrMC+Cmx
 nB0MSWkJWEy4itfl+9bdgUOtD+I9Ga/K+KGC5dNbW091iWNwo+CT81CdgMAAz96KRfMo
 PGDQ==
X-Received: by 10.112.137.136 with SMTP id qi8mr9238065lbb.41.1406942595923;
 Fri, 01 Aug 2014 18:23:15 -0700 (PDT)
Received: from openSUSE.linux ([176.100.246.237])
 by mx.google.com with ESMTPSA id aq10sm15678593lbc.9.2014.08.01.18.23.15
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Fri, 01 Aug 2014 18:23:15 -0700 (PDT)
From: Dmitry Selyutin <ghostman.sd@gmail.com>
X-Google-Original-From: Dmitry Selyutin <ghostmansd@gmail.com>
Message-ID: <53DC3D5E.5080909@gmail.com>
Date: Sat, 02 Aug 2014 05:22:38 +0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Pedro Giffuni <pfg@FreeBSD.org>, David Chisnall <theraven@freebsd.org>,
 soc-status@FreeBSD.org
Subject: Report #5: Unicode support
References: <53DC3C41.7070105@gmail.com>
In-Reply-To: <53DC3C41.7070105@gmail.com>
X-Forwarded-Message-Id: <53DC3C41.7070105@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-BeenThere: soc-status@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Summer of Code Status Reports and Discussion <soc-status.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/soc-status>,
 <mailto:soc-status-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/soc-status/>
List-Post: <mailto:soc-status@freebsd.org>
List-Help: <mailto:soc-status-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/soc-status>,
 <mailto:soc-status-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Aug 2014 01:23:19 -0000

Sorry, I've forgotten to modified theme according to rules.
Sending this message again so anyone can find it more easy.
Sorry for being annoying.


Hello everyone!

Here is my report on progress that was achieved during this time. I've
implemented actual Unicode Collation Algorithm for DUCET (Default
Unicode Collation Element Table). I had to rewrite the entire
implementation: I wasn't satisfied with its quality and the way that
I've organized my source code, so I reverted my code and started again.
My previous implementation was full of hard-coded parts and it was a bit
harder to take anything useful from it for any other project. Now the
entire implementation is available in include/unicode.h and
lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll()
will use new collation algorithm. struct _xlocale was modified in the
way it will use two new members, colltable and collsize, which are just
transmitted to __ucscoll(). If element is not found in the given table
or table is NULL, then __ucscoll() tries to find this element in DUCET;
if element was not found, then __ucscoll generates collation.

I couldn't understand how the alternate shall be used though; it seems
that it can be dropped since wcscoll() doesn't has any version that
supports tailoring. I left it for now, but I'm pretty sure that we can
omit it.

I hadn't time to test wcscoll() better (especially using files provided
by Unicode Character Database), so this is the task that I will do right
now. :-) There are still several ways to improve the speed of the
algorithm, but I feel that the time for it hasn't come yet. style(9)
issues will also be handled (if any), just too tired to do it right now.

__ucscoll() just uses __ucsxfrm(), then compares the strings using
wcscmp() (this is the only platform-dependent part of code, I was too
lazy to write __ucslen(), so I left it as it is). This collation
algorithm support three levels; the last IIRC is usually the character
itself if not defined, so I decided to omit it (especially since I'm not
sure how variable weights should be handled). Any thoughs?

-- 
With best regards,
Dmitry Selyutin