From owner-freebsd-hackers  Tue Apr  4 12: 7:58 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from phobos.illtel.denver.co.us (dsl-206.169.4.82.wenet.com [206.169.4.82])
	by hub.freebsd.org (Postfix) with ESMTP id 82C7137B56E
	for <freebsd-hackers@FreeBSD.ORG>; Tue,  4 Apr 2000 12:07:54 -0700 (PDT)
	(envelope-from abelits@phobos.illtel.denver.co.us)
Received: from localhost (abelits@localhost)
	by phobos.illtel.denver.co.us (8.9.3/8.9.3) with ESMTP id MAA09937;
	Tue, 4 Apr 2000 12:08:39 -0700
Date: Tue, 4 Apr 2000 12:08:39 -0700 (PDT)
From: Alex Belits <abelits@phobos.illtel.denver.co.us>
To: "G. Adam Stanislav" <adam@whizkidtech.net>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: Unicode on FreeBSD
In-Reply-To: <3.0.6.32.20000404100544.00882db0@mail85.pair.com>
Message-ID: <Pine.LNX.4.20.0004041104290.6811-100000@phobos.illtel.denver.co.us>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Tue, 4 Apr 2000, G. Adam Stanislav wrote:

> At 22:51 03-04-2000 -0700, Alex Belits wrote:
> >  I agree that Unicode created a good list of glyphs, and it can be
> >useful for fonts and conversion tables, but it's completely inappropriate
> >as the base of format used in real-life applications for storage and
> >communications.
> 
> Oh, I think it's great for communications. I design web sites. It is good
> to have a single character representation supported by Internet standards.
> Saves a lot of work. Before UTF-8 became widely accepted, a typical Slovak
> web page started by a menu of choices of which encoding your browser
> supported. You had to have 3 - 4 versions of each page. A major pain! Now
> you only need one.

  This is a problem, however Unicode is not the only solution -- actually
it's the worst of all solutions -- it solves simple problem only to create
a lot of complex ones.

> 
> Or even when designing English pages in a typographically correct way
> (opening and closing quotes, and things like that), it was a pain before
> UTF-8 because while ISO-8859-1 is the assumed default, Microsoft, in its
> infinite wisdom created a slight modification of ISO-8859-1 which they
> called ANSI, and which the uninitiated commonly believed to be the same as
> ISO-8859-1. As a result, there are a myriad of web pages out there that use
> the Microsoft encoding, and there are those that use true ISO-8859-1. So
> many browsers assume that you are using the MS "standard." It's a real mess.

  Misrepresentation of one popular encoding in software of one company
doesn't mean that it should be replaced with another, much more complex
one, by everyone else.

> 
> So, in all my recent pages I use UTF-8, and the problem is solved.
> 
> >> Unicode Consortium
> >> has no power to force Unicode on anyone. It just happens that it was widely
> >> accepted.
> >
> >  So far only by one company actually "accepted" it -- Microsoft. Everyone
> >else (except Java/Sun) just happened to be depended on them. Java and
> >Plan9 are special cases because both are essentially endless storages of
> >ivory-tower design idiosyncrasy and arbitrary decisions made by handful of
> >people.
> 
> I was not talking about companies. I was talking about people with genuine
> i18n needs.

  People with genuine i18n needs such as linguists or people with genuine
i18n needs such as non-English users? Linguists don't see Unicode as being
sufficient, and everyone else uses local encodings/charsets. I agree that
local encodings are very limiting in the form they exist now, however
they, not Unicode, are standards used in real life. If some encapsulation
format (not as limited as iso 2022 and not as restrictive as MIME
multipart) will be created to support multiple
charsets/encodings/languages in one document in labeled chunks, the same
problem would be solved with minimal changes in existing software and
minimal document conversion efforts. This solution will be far superior to
Unicode, and even for "web" use it can be made compatible with charsets
support in existing browsers.

[skipped without much of disagreement]

> Again, it's not about "adoption" of Unicode, it's about supporting Unicode
> for those who need it. Going Unicode-only would not be wise, but I don't
> see anyone here suggesting that.

  After looking at what happened to IETF documents, XML and perl I can
only come to conclusion that Unicode, once included in some system that
didn't have multiple-charset document support infrastructure before that,
starts requiring more and more sacrifices to be supported decently until
the support of other encodings becomes impossible or significantly more
difficult than support of Unicode. I am not against the support of any
charset, encoding or language used in the real world, Unicode included.
However after seeing how Unicode "support" efforts quickly turn into
"adoption" all across the libraries/protocols/applications layers, I
believe that only if some decent charset/encoding/language labeling
infrastructure will be developed, it will be possible to contain
charsets and prevent their "leaking" to application level.

  Leaking of ASCII (infamous 7-bit restriction that was present for no
understandable reason in a lot of protocols and utilities) was a painful
enough experience already, and it looks like it's fixed in most of stuff
by now. Leaking of local charsets (especially iso 8859-1 and its
modifications) was bad, however it was mostly prevented by locale support
(even though it is clumsy and unusable in multilingual documents). Leaking
of Unicode and UTF-8 can start something even worse because it's already
evident that many applications written to support UTF-8 character format,
have the hardcoded assumption of this format in their i/o and parsing
routines that otherwise are supposed to be either charset-blind, or use
external, charset-dependent routines to determine characters boundaries.

  I don't want to be misunderstood as the opponent of all things Unicode
-- as I have said, its support is useful. However I oppose:

1. The point of view that Unicode is the only possible or the best
possible way to handle multilingual documents.

2. The point of view that support of Unicode should be made at the expense
of compatibility with everything else, or by the introduction of some
unsafe guesswork such as application of UTF-8 validity check to determine 
if the chunk of data is in UTF-8 or not.

  I see the "support" or "adoption" of Unicode as a threat only if it will
be made based on those ideas, and I think that the development of
charset/encoding/language labeling or encapsulation format and handling
routines, even if it will not be "blessed" by IETF or TOG, will provide
means of safe, compatible and relatively easy handling of multilingual
documents, including ones that are completely or in part are in Unicode.

  Unicode documents themselves suffer from the lack of language-labeling
information, and there is (currently unused however "standardized") way to
label _language_ (not charset, subset or encoding) within the Unicode
text. It's not used because it contradicts with the idea of "easy",
completely stateless and non-encapsulated Unicode text, so its support is
allmost completely impossible in existing Unicode support
infrastructures. Instead language labeling is pushed up into XML (or other
formats) parsers and applications thus making it application-dependent and
ultimately unreliable. I think that if some more reasonable labeling
(encapsulation, metadata or attributes handling -- in whatever way it
will be called) system will be created for text "documents", it can solve
this problem by just assigning charset, encoding and language to pieces of
text, and leaving "unknown" or unattributed text alone, not allowing
language-specific or charset-dependent routines to touch it. In system
like this Unicode will be labeled as Unicode, UTF-8 will be labeled as
UTF-8, and Russian language will be labeled as Russian language
independently, thus allowing to build a languages support infrastructure
that in most of places can use existing formats safely as languages will 
be clearly marked where known, no guesswork will be applied, and no
conversion to Unicode (or anything else) will be required.

-- 
Alex

----------------------------------------------------------------------
 Excellent.. now give users the option to cut your hair you hippie!
                                                  -- Anonymous Coward


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message