From owner-freebsd-hackers  Tue May 11 13:17: 3 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from whizkidtech.net (r42.bfm.org [208.18.213.138])
	by hub.freebsd.org (Postfix) with ESMTP id 7AFA314BFD
	for <freebsd-hackers@freebsd.org>; Tue, 11 May 1999 13:16:48 -0700 (PDT)
	(envelope-from adam@whizkidtech.net)
Received: (from adam@localhost)
	by whizkidtech.net (8.9.2/8.9.2) id PAA00296
	for freebsd-hackers@freebsd.org; Tue, 11 May 1999 15:16:43 -0500 (CDT)
	(envelope-from adam)
Date: Tue, 11 May 1999 15:16:12 -0500
From: "G. Adam Stanislav" <adam@whizkidtech.net>
To: freebsd-hackers@freebsd.org
Subject: 10646
Message-ID: <19990511151612.B271@whizkidtech.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.3i
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Hi, all,

Just to keep you posted on the progress of the wctype routines:

I put a program called 10646 on the web page. It will read the Unicode data
file and produce a file which then can be fed to mklocale(1) to produce
a Unicode locale.

If renamed (or linked as) ees, it will produce a much smaller file, containing
only the Extended European Subset of Unicode/ISO 10646.

Because the ISO 10646 explicitly permits the use of subsets of the standard,
the output of the 10646/ees utility contains comments at the beginning and the
end of each block of characters. The comment is always either
/* BeginBlockName */ or /* EndBlockName */, where "BlockName" is whatever
Unicode calls that block (but with blanks cut out). The idea is to make it
possible to run the file through sed and delete any unwanted blocks before
inputting it to mklocale.

I would appreciate comments and suggestions. The file is 10646.tar.gz,
downloadable (or fetchable) from http://www.whizkidtech.net/i18n/wc/.

You will also need the current version of the Unicode database, which is at
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt - be warned, it
is a fairly big file.

If you want, you can do it all in one step:

	10646 < UnicodeData-Latest.txt | mklocale > LC_CTYPE

By the way (as I mention on the page), our utf2 program is misnamed. There is
no such thing as UTF-2. There is UTF-7, UTF-8, UTF-16 - the number meaning
the number of BITS in its encoding. There also is UCS-2 and UCS-4 - the number
meaning the number of BYTES (or "octets").

We should really rename utf2 to utf8, because that's what it is. UTF-2 implies
2-bit encoding...

Adam


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message