From owner-soc-status@FreeBSD.ORG Sat Aug 2 01:23:19 2014 Return-Path: Delivered-To: soc-status@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E3A1B869; Sat, 2 Aug 2014 01:23:18 +0000 (UTC) Received: from mail-la0-x234.google.com (mail-la0-x234.google.com [IPv6:2a00:1450:4010:c03::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 113ED22C7; Sat, 2 Aug 2014 01:23:17 +0000 (UTC) Received: by mail-la0-f52.google.com with SMTP id e16so3726589lan.39 for ; Fri, 01 Aug 2014 18:23:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:message-id:date:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=pQ49q4VQ4PXP/AyzsPQZPVHc8gG1DRYaEBvjb1p7D+U=; b=QYc8qVX7YzVZq/202os8NyGO08CWwtLy3OZ3Vh6OoxIQqTtEwtmNNyvsbCKPzUV0ZS ISjB2TdT+E64nXJzokHBtJCBz0CR3mZAZK80GOXOY3Hp9mdVEFHWU66cGEa34MjjFYA4 UATof0vCEfVemH1qBUtO43QOsW7vOc0JuxxHBCFM4JEDAZtFGCEvpf9Ko+Udnckm7Nr0 DwUgY4MtYj52gayMHOo0T9cYbufQK7vhVq7Ywuwm5anYWuqa9WKxNeSBqMIfcrMC+Cmx nB0MSWkJWEy4itfl+9bdgUOtD+I9Ga/K+KGC5dNbW091iWNwo+CT81CdgMAAz96KRfMo PGDQ== X-Received: by 10.112.137.136 with SMTP id qi8mr9238065lbb.41.1406942595923; Fri, 01 Aug 2014 18:23:15 -0700 (PDT) Received: from openSUSE.linux ([176.100.246.237]) by mx.google.com with ESMTPSA id aq10sm15678593lbc.9.2014.08.01.18.23.15 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 01 Aug 2014 18:23:15 -0700 (PDT) From: Dmitry Selyutin X-Google-Original-From: Dmitry Selyutin Message-ID: <53DC3D5E.5080909@gmail.com> Date: Sat, 02 Aug 2014 05:22:38 +0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Pedro Giffuni , David Chisnall , soc-status@FreeBSD.org Subject: Report #5: Unicode support References: <53DC3C41.7070105@gmail.com> In-Reply-To: <53DC3C41.7070105@gmail.com> X-Forwarded-Message-Id: <53DC3C41.7070105@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-BeenThere: soc-status@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Summer of Code Status Reports and Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Aug 2014 01:23:19 -0000 Sorry, I've forgotten to modified theme according to rules. Sending this message again so anyone can find it more easy. Sorry for being annoying. Hello everyone! Here is my report on progress that was achieved during this time. I've implemented actual Unicode Collation Algorithm for DUCET (Default Unicode Collation Element Table). I had to rewrite the entire implementation: I wasn't satisfied with its quality and the way that I've organized my source code, so I reverted my code and started again. My previous implementation was full of hard-coded parts and it was a bit harder to take anything useful from it for any other project. Now the entire implementation is available in include/unicode.h and lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll() will use new collation algorithm. struct _xlocale was modified in the way it will use two new members, colltable and collsize, which are just transmitted to __ucscoll(). If element is not found in the given table or table is NULL, then __ucscoll() tries to find this element in DUCET; if element was not found, then __ucscoll generates collation. I couldn't understand how the alternate shall be used though; it seems that it can be dropped since wcscoll() doesn't has any version that supports tailoring. I left it for now, but I'm pretty sure that we can omit it. I hadn't time to test wcscoll() better (especially using files provided by Unicode Character Database), so this is the task that I will do right now. :-) There are still several ways to improve the speed of the algorithm, but I feel that the time for it hasn't come yet. style(9) issues will also be handled (if any), just too tired to do it right now. __ucscoll() just uses __ucsxfrm(), then compares the strings using wcscmp() (this is the only platform-dependent part of code, I was too lazy to write __ucslen(), so I left it as it is). This collation algorithm support three levels; the last IIRC is usually the character itself if not defined, so I decided to omit it (especially since I'm not sure how variable weights should be handled). Any thoughs? -- With best regards, Dmitry Selyutin