From owner-freebsd-hackers  Tue Jan 20 22:07:23 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id WAA16506
          for hackers-outgoing; Tue, 20 Jan 1998 22:07:23 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from phobos.illtel.denver.co.us (abelits@phobos.illtel.denver.co.us [207.33.75.1])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id WAA16493
          for <hackers@FreeBSD.ORG>; Tue, 20 Jan 1998 22:07:13 -0800 (PST)
          (envelope-from abelits@phobos.illtel.denver.co.us)
Received: from localhost (abelits@localhost) by phobos.illtel.denver.co.us (8.8.8/8.6.9) with SMTP id WAA10137; Tue, 20 Jan 1998 22:10:48 -0800
Date: Tue, 20 Jan 1998 22:10:47 -0800 (PST)
From: Alex Belits <abelits@phobos.illtel.denver.co.us>
To: "Louis A. Mamakos" <louie@TransSys.COM>
cc: Terry Lambert <tlambert@primenet.com>, daniel_sobral@voga.com.br,
        hackers@FreeBSD.ORG
Subject: Re: Wide characters on tcp connections 
In-Reply-To: <199801210420.XAA23356@whizzo.TransSys.COM>
Message-ID: <Pine.LNX.3.96.980120204200.9621A-100000@phobos.illtel.denver.co.us>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk

On Tue, 20 Jan 1998, Louis A. Mamakos wrote:

> If I had to choose, I'd use UTF-8 encodings in big-endian byte order.  This
> is, I believe, what the IETF has chosen when dealing with multi-byte
> characters which are embedded within other protocols.

  IETF "has chosen" UTF-8 (and Unicode) after every nation, with or 
without multibyte alphabet, rejected Unicode as a standard, but some
well-known company decided to make "internationalization standard" based
on Unicode (and still failed to implement it properly even in their also
well-known OS).

  UTF-8 got a lot of support in Western Europe and US, however it should
be mentioned that when converted to Unicode and then UTF-8, ASCII text is
the same as before encoding, iso8859-1 (Latin1) has trivial back 
conversion, but other languages look umm... too unstructured for their
native speakers to say the least.

  There is a number of issues of linguistic, technical and political
nature that were ignored when Unicode was designed, in other words
everything made in local standards was thrown away, and just all known
at the moment (and considered to be worthy enough to be included)
characters were listed in some order resembling their alphabets. UTF-8
encoding is blatantly US/European-centric -- that can be justified
(it's supposed to be used for everything, and most of "everything" is
ASCII text), but it's ridiculous for other languages, and I haven't even
started talking about regexps and text processing over variable-length
characters or constant encoding/decoding into fixed-length Unicode that
UTF-8 makes necessary for everything but "word processing" that some
people confuse with the use of computers.

   Currently no one uses Unicode for anything serious in non-European
languages, and since MIME has no problems with charsets labeling, people
continue to use local charsets that reflect local language's structure way
better than Unicode. However it looks like this "yet another Esperanto" is
going to be the next way of making more money selling "new" software with
"standards compliance" sticker without actually providing any languages
support and without compatibility with anything that currently is in use.

--
Alex