From owner-freebsd-hackers Tue May 11 13:17: 3 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from whizkidtech.net (r42.bfm.org [208.18.213.138]) by hub.freebsd.org (Postfix) with ESMTP id 7AFA314BFD for ; Tue, 11 May 1999 13:16:48 -0700 (PDT) (envelope-from adam@whizkidtech.net) Received: (from adam@localhost) by whizkidtech.net (8.9.2/8.9.2) id PAA00296 for freebsd-hackers@freebsd.org; Tue, 11 May 1999 15:16:43 -0500 (CDT) (envelope-from adam) Date: Tue, 11 May 1999 15:16:12 -0500 From: "G. Adam Stanislav" To: freebsd-hackers@freebsd.org Subject: 10646 Message-ID: <19990511151612.B271@whizkidtech.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.3i Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hi, all, Just to keep you posted on the progress of the wctype routines: I put a program called 10646 on the web page. It will read the Unicode data file and produce a file which then can be fed to mklocale(1) to produce a Unicode locale. If renamed (or linked as) ees, it will produce a much smaller file, containing only the Extended European Subset of Unicode/ISO 10646. Because the ISO 10646 explicitly permits the use of subsets of the standard, the output of the 10646/ees utility contains comments at the beginning and the end of each block of characters. The comment is always either /* BeginBlockName */ or /* EndBlockName */, where "BlockName" is whatever Unicode calls that block (but with blanks cut out). The idea is to make it possible to run the file through sed and delete any unwanted blocks before inputting it to mklocale. I would appreciate comments and suggestions. The file is 10646.tar.gz, downloadable (or fetchable) from http://www.whizkidtech.net/i18n/wc/. You will also need the current version of the Unicode database, which is at ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt - be warned, it is a fairly big file. If you want, you can do it all in one step: 10646 < UnicodeData-Latest.txt | mklocale > LC_CTYPE By the way (as I mention on the page), our utf2 program is misnamed. There is no such thing as UTF-2. There is UTF-7, UTF-8, UTF-16 - the number meaning the number of BITS in its encoding. There also is UCS-2 and UCS-4 - the number meaning the number of BYTES (or "octets"). We should really rename utf2 to utf8, because that's what it is. UTF-2 implies 2-bit encoding... Adam To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message