From owner-soc-status@FreeBSD.ORG  Sun Jun 15 22:25:55 2014
Return-Path: <owner-soc-status@FreeBSD.ORG>
Delivered-To: soc-status@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 976768D1;
 Sun, 15 Jun 2014 22:25:55 +0000 (UTC)
Received: from mail-la0-x22f.google.com (mail-la0-x22f.google.com
 [IPv6:2a00:1450:4010:c03::22f])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id BA74028C5;
 Sun, 15 Jun 2014 22:25:54 +0000 (UTC)
Received: by mail-la0-f47.google.com with SMTP id pn19so2517791lab.20
 for <multiple recipients>; Sun, 15 Jun 2014 15:25:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:message-id:date:user-agent:mime-version:to:subject
 :content-type:content-transfer-encoding;
 bh=UBfy54sjULm2Qr3zhE3/80dlOhtcVn1HjvdAme62oUI=;
 b=HK542bke5auIvS6CSTClw6TTEGhcQ0Fkd45QHIFuUvnqu6IH5TEFjVXZgQMpvwkjJp
 84NGyXpT7coohXgB3Ufn/2y3O+LVgjmKIkGOGKQca5Pnz+xs2hnzRNeG1Cnq810+h0R7
 mLvbJ5bPDj+nIrPsDFRao52gBZ7vFmNp1RBMdj0PyXviGwSXuaaS7EDQdgVq/88QmCag
 cag3VJxeFv7gaGglf7rTk/q07r4WVJDXGmxOAluova6OHPbPobUnnYI0fDe1uBLRG+BN
 zBkkqKlUKmfAaGsNiLD9Aw5W4AIgqC189g2pph2DRpB+h6KTgMHYZ4wSqNSoftMgU53q
 CQow==
X-Received: by 10.112.42.2 with SMTP id j2mr63667lbl.90.1402871151222;
 Sun, 15 Jun 2014 15:25:51 -0700 (PDT)
Received: from openSUSE.linux ([176.100.246.237])
 by mx.google.com with ESMTPSA id ui5sm7093599lbb.32.2014.06.15.15.25.50
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Sun, 15 Jun 2014 15:25:50 -0700 (PDT)
From: Dmitry Selyutin <ghostman.sd@gmail.com>
X-Google-Original-From: Dmitry Selyutin <ghostmansd@gmail.com>
Message-ID: <539E1D53.6030103@gmail.com>
Date: Mon, 16 Jun 2014 02:25:23 +0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: soc-status@FreeBSD.org, Pedro Giffuni <pfg@FreeBSD.org>, 
 David Chisnall <theraven@freebsd.org>
Subject: Report #1: Unicode support
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-BeenThere: soc-status@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Summer of Code Status Reports and Discussion <soc-status.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/soc-status>,
 <mailto:soc-status-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/soc-status/>
List-Post: <mailto:soc-status@freebsd.org>
List-Help: <mailto:soc-status-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/soc-status>,
 <mailto:soc-status-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Jun 2014 22:25:55 -0000

This is a report on progress in improving Unicode support in FreeBSD.

During the early period, I've been studying Unicode Technical Standard,
which describes how to implement Unicode Collation Algorithm.
I've tried to use the patch proposed by Konrad Jankowski, but it was a
rather unsuccessful attempt, since this patch predates xlocale support,
implemented by David Chisnall. The initial plan was to port collation
support from Apple's libc library, but we rejected this idea because of
poor code quality. Moreover, if we decided to use Apple's libc, we would
have broken the entire xlocale support.

Having lost a significant amount of time on Apple's libc and Konrad's
path, we've decided to implement collation from scratch according to
Unicode Normalization Algorithm.

One of the requirements for collation is the normalization of the string
before performing actual collation. C Standard Library lacks such
feature, so I started to implement it. This work is almost finished; the
FreeBSD's libc will have __strnorm_l(), __strnorm() and __wcsnorm()
functions. They have man pages and can be already used to normalize
ASCII, Latin-1 and Hangul strings. The last part is to implement
normalization of the other characters, which is usually done using
database lookup (usually Unicode data is stored in arrays, where each
array denotes single Multilingual Plane).
These functions are designed in the way that may allow to include them
in POSIX standard later under strnorm(), strnorm_l() and wcsnorm()
names. If _LIBC_UNICODE_ADDENDA macro is defined, they will be already
available under these names.

Unicode Standard is a bit difficult: sometimes Unicode Standard focuses
on details, paying little attention to the main part. However, I'm
planning to finish normalization algorithm in day or two and then
implement a collation algorithm.

We lost a significant time on trying to port Konrad's patch and Apple's
libc collation algorithm. Now we focus on the Unicode Standard directly;
that seems to be a better decision. The first step is to implement
collation algorithm in the canonical way, then to focus on its
improvements and testing.

I'd also like to thank my mentors, Pedro and David, who were (and are)
so kind to give me advice throughout my work. It's particularly valuable
since our task is not so simple as it may seem to be. :-)


-- 
With best regards,
Dmitry Selyutin