From owner-soc-status@FreeBSD.ORG Mon Jun 16 01:04:56 2014 Return-Path: Delivered-To: soc-status@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 96E15F5F for ; Mon, 16 Jun 2014 01:04:56 +0000 (UTC) Received: from nm3-vm0.bullet.mail.bf1.yahoo.com (nm3-vm0.bullet.mail.bf1.yahoo.com [98.139.212.154]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 39AB22409 for ; Mon, 16 Jun 2014 01:04:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1402880688; bh=8lkDxvtPCt2GN9ogRJH/m62rTi8MbOHbSWneyhO+Z/Q=; h=Received:Received:Received:X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:X-Rocket-Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=k+bl2a98bGTFN1DT2Ek7P9VokNczU27mPwZceapmvaOELZ1qvkm0uhFMx9FA2g4wkzr/F6gzWCjR7Kghf/WkqN6Op3rZ4qPoJp1VCfWtWuz9ptWCrT/KBtMF6MmiylOyiVJMxccy1Wjl8d6QkQPkNpoiz1uD6Fgux9z9qcaBRsLPFsdKyqR/1ALbN9c/pcUu4BzLRIjqofq14tb5kjhKL9pOQas2u/Jz8PEAVbZgaDDrln+GvrpQH737CYFLAAcjLp4lUIJPXKXd+9clyZF5E2/jP3GIqJWMgJ3IybQEbvP5nf9DF4jQisarfwcE8RiDI76AX635ZAqAxShY2zkz3g== DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s2048; d=yahoo.com; b=HQ74McxT8LXd9c2FQaNI5FmHcT7fRndt3Txx1JpHZ/xu61ATByUJBSs+uSzhp+7q+O9BiWyCSId2THoWYhpbZN3Vz+sBG45gaeWAVT4kItRnL3CuSty1KbjNZceWrVGcY2F8qrYRyhcIfl0xQbeKasixz5tOB5irwFRLFyKEbDhsvwqyVlUwieYT7mn/eCI/BwSi+KfUCNnhDvfGgvU3uzN4Uh8MzdD9GbkrFGeZwAOXJYD8DKdxoiMTSWhfDpAfQUa4nkASRuphVrRW/XS/dbZLxf8f6qIp+D2LzS2RCbh1u1X78dN6YA6zVXGt8ETgOweREGLxrBQHMnLN6Agm3A==; Received: from [98.139.212.153] by nm3.bullet.mail.bf1.yahoo.com with NNFMP; 16 Jun 2014 01:04:48 -0000 Received: from [68.142.230.78] by tm10.bullet.mail.bf1.yahoo.com with NNFMP; 16 Jun 2014 01:04:48 -0000 Received: from [127.0.0.1] by smtp235.mail.bf1.yahoo.com with NNFMP; 16 Jun 2014 01:04:48 -0000 X-Yahoo-Newman-Id: 42085.71384.bm@smtp235.mail.bf1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: R9pbbtAVM1lblVTQ8Vw4Pa5Bjy_0TqQt4oC0IRAXunQ6Z3p b6OKTf0HE3k0qpatqVtacFdBYY2nLL2y0lxC9cr7TGR2GWhtadh8iN_hdrvx f_HCCVh1KXuvYvTYucWVwW5Dmdl0F5YQ3mXyoYcOxrHol2i7lFPesfp5Fdv4 BGRbsmZTkdggKicl.KCBIfaEGCnKYABiHlm9EPnANHdrviW.VbRhFNPlUoou KcHCoTNCDoSglygmEZtVpZn_85ZBlzTkMS0IfzXDcRXd9X7rQ5atnmhBWQOZ yeCL8LM5X7C2Cz4fAgc2x_Yz2_uB2hR1ZkyN47nCNXUP9dShzodW9nWBXzko nAcUrUQF8mLGhcjZzzDJ4ApItcBJbcessvRsoPRaSwb2c.f7XJCqXvV1yB17 NBfQHNTq4tHl.zIOgDviJLbHPGQbpUQQcoZRt6C9PChklvRwtdJ9VfTk.KXl Ins4q1CesTOoc.u.raFYqALNOJejl86feUmLb47Dvso20ffFJyzIDTh1XV0r 06L50j4UwUIxho06O9tsSlBnwDtKKz2TJaw-- X-Yahoo-SMTP: xcjD0guswBAZaPPIbxpWwLcp9Unf X-Rocket-Received: from [192.168.0.102] (pfg@190.157.126.109 with plain [63.250.193.228]) by smtp235.mail.bf1.yahoo.com with SMTP; 16 Jun 2014 01:04:48 +0000 UTC Message-ID: <539E42BB.4060801@freebsd.org> Date: Sun, 15 Jun 2014 20:04:59 -0500 From: Pedro Giffuni User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Dmitry Selyutin , soc-status@FreeBSD.org Subject: Re: Report #1: Unicode support References: <539E1D53.6030103@gmail.com> In-Reply-To: <539E1D53.6030103@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: soc-status@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Summer of Code Status Reports and Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2014 01:04:56 -0000 Hi Dmitry and list; On 06/15/14 17:25, Dmitry Selyutin wrote: > This is a report on progress in improving Unicode support in FreeBSD. > > During the early period, I've been studying Unicode Technical Standard, > which describes how to implement Unicode Collation Algorithm. > I've tried to use the patch proposed by Konrad Jankowski, but it was a > rather unsuccessful attempt, since this patch predates xlocale support, > implemented by David Chisnall. The initial plan was to port collation > support from Apple's libc library, but we rejected this idea because of > poor code quality. Moreover, if we decided to use Apple's libc, we would > have broken the entire xlocale support. > > Having lost a significant amount of time on Apple's libc and Konrad's > path, we've decided to implement collation from scratch according to > Unicode Normalization Algorithm. I wouldn't call it exactly a "waste of time". I think it was important to rescue the work done by the previous Summer of Code plus Dmitry had to learn his way around FreeBSD's libc. Of course the code has changed a lot and the approach was not really successful but it was useful nevertheless. An important part of this work will be testing and Konrad did set up a set of tests. > One of the requirements for collation is the normalization of the string > before performing actual collation. C Standard Library lacks such > feature, so I started to implement it. This work is almost finished; the > FreeBSD's libc will have __strnorm_l(), __strnorm() and __wcsnorm() > functions. They have man pages and can be already used to normalize > ASCII, Latin-1 and Hangul strings. The last part is to implement > normalization of the other characters, which is usually done using > database lookup (usually Unicode data is stored in arrays, where each > array denotes single Multilingual Plane). > These functions are designed in the way that may allow to include them > in POSIX standard later under strnorm(), strnorm_l() and wcsnorm() > names. If _LIBC_UNICODE_ADDENDA macro is defined, they will be already > available under these names. > > Unicode Standard is a bit difficult: sometimes Unicode Standard focuses > on details, paying little attention to the main part. However, I'm > planning to finish normalization algorithm in day or two and then > implement a collation algorithm. > > We lost a significant time on trying to port Konrad's patch and Apple's > libc collation algorithm. Now we focus on the Unicode Standard directly; > that seems to be a better decision. The first step is to implement > collation algorithm in the canonical way, then to focus on its > improvements and testing. > > I'd also like to thank my mentors, Pedro and David, who were (and are) > so kind to give me advice throughout my work. It's particularly valuable > since our task is not so simple as it may seem to be. :-) > > It is indeed a difficult task and Dmitry will be very busy these days ;). Pedro.