Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Apr 2000 11:03:58 -0700 (PDT)
From:      Alex Belits <abelits@phobos.illtel.denver.co.us>
To:        "G. Adam Stanislav" <adam@whizkidtech.net>
Cc:        MikeM <mike_bsdlists@yahoo.com>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Unicode on FreeBSD
Message-ID:  <Pine.LNX.4.20.0004041102320.6811-100000@phobos.illtel.denver.co.us>

next in thread | raw e-mail | index | archive | help

On Mon, 3 Apr 2000, G. Adam Stanislav wrote:

> At 20:59 03-04-2000 -0700, Alex Belits wrote:
> >  I feel perfectly fine with "multilingual" documents that contain English
> >and Russian text without Unicode.
> 
> Those are bilingual, not multilingual. I once had to create a document in
> English, Slovak, and Sanskrit (using Devanagari alphabet). There is only
> one standard that makes it possible: Unicode. Too bad UTF-8 did not exist
> at the time, and I had to use graphics.

  There is another format that does the same thing better -- MIME
multipart documents. Too bad, the development in that direction stopped
after certain stupid decision made by some people in IETF.

> >> Everyone who wants to
> >> follow a single international standard as opposed to a slew of mutually
> >> exclusive local standards. Anyone who thinks globally.
> 
> >  "Globally" in this case means following self-proclaimed unificators from
> >Unicode Consortium.
> 
> I don't know what you mean by "unificators." Why self proclaimed? Those
> were people with a need for which they found a solution.

  With a need to find a cause to break backward compatiobility with
everything and sell more software -- just like ITU.

  I agree that Unicode created a good list of glyphs, and it can be
useful for fonts and conversion tables, but it's completely inappropriate
as the base of format used in real-life applications for storage and
communications.

> Unicode Consortium
> has no power to force Unicode on anyone. It just happens that it was widely
> accepted.

  So far only by one company actually "accepted" it -- Microsoft. Everyone
else (except Java/Sun) just happened to be depended on them. Java and
Plan9 are special cases because both are essentially endless storages of
ivory-tower design idiosyncrasy and arbitrary decisions made by handful of
people.

> You're free to create your own system, or ignore it all together.
> But just because you see no need for Unicode does not mean you should be
> upset when people are willing to work on Unicode support in FreeBSD.

  I have just asked, who will benefit from it. No one answered "I will" --
everyone who makes Unicode support believes that it will benefit someone
else.

> 
> >> Anyone who has anything to do with the Internet must deal with UTF-8:
> >> "Protocols MUST be able to use the UTF-8 charset, which consists of the ISO
> >> 10646 coded character set combined with the UTF-8 character encoding
> >> scheme, as defined in [10646] Annex R (published in Amendment 2), for all
> >> text." <RFC 2277>
> 
> >  This is not approved by ANYONE but a bunch of "unificators". It never
> >was widely discussed, and affected people never had a chance to give any
> >input. This is the same kind of "standard documents" that ITU issues by
> >dozens.
> 
> Affected in what way? Many ways of encoding Unicode were proposed,
> developed, and used. Most of them are history by now. UTF-8 is the best way
> to encode Unicode to this day. Don't like it? Design a better one.

  I am not talking about Unicode representations and encodings but about
Unicode itself. I agree that UTF-8 is the only way to marry Unicode with
text and Unix, however I don't see much point in doing that.

> 
> >> >-- I am Russian.
> >> 
> >> So?
> >
> >  So I don't want UTF-8 to be forced on me.
> 
> Who's forcing it on you?

  IETF. All recent RFCs are littered with referenced to UTF-8 in all 
places where reasonable standards would have "8-bit clean" with no
explicit low-level semantics attached.

> > Charset definitions in MIME
> >headers exist for a reason. If we want to make something usable we can
> >create a format that can encapsulate existing charsets instead of banning
> >them altogether and replacing with "unified" stuff where cut(1) and
> >dd(1) can produce the output that will be declared "illegal" to be
> >processed as text because it can not be a valid UTF-8 sequence.
> 
> You are worried about nothing. No one in this discussion has said anything
> about making anything but Unicode and UTF-8 "illegal." Supporting Unicode
> does not mean stopping support for everything else.

  I have spent enough time with "unicoders" to become convinced that the
depth of changes they demand in protocols and libraries is enough to make
it a game of "everything or nothing" -- partial implementations become
unsafe because the design of libraries and prococols hinges on the idea
that only one charset/encoding may exist, so no ways to provide charset
and encoding are left.

> >  One of the most basic strengths of Unix is the ease with which text can
> >be manipulated, and how "non-text" data can be processed using the same
> >tools without any complex "this is text and this is not"
> >application-specific procedures.
> 
> Nothing complex about it. UTF-8 uses a very simple algorithm which makes it
> very simple to distinguish text from non-text.

  This is the problem. There is no "text" and "non-text" -- there is
"valid UTF-8" and everything else. Software designed in "unix style"
can't do heuristics and guess that if the data has some properties (such
as passing UTF-8 validity test) it is really some particular kind of data
and should be treated in some different manner. It's irresponsible to
assume that everything that "looks like UTF-8" is a text, and everything
else is "binary" unless all the program does is displaying the data to the
user. What is worse there is the situation where the UTF-8 validity test
is applied to some endless stream (such as data arriving to stdin) -- for
how long the data should contain only valid UTF-8 sequences to be
considered "text"? And what should the program do if somewhere in the
middle of 65537'th megabyte a "non-text" sequence of bytes is found?

> >UTF-8 turns "text" into something that
> >gives us a dilemma -- to redesign everything to treat "text" as the stream
> >of UTF-8 encoded Unicode (and make it impossible to combine text and
> >"non-text" without a lot of pain), or to leave tools as they are and deal
> >with "invalid" output from perfectly valid operations.
> 
> You don't have to treat everything as the stream of UTF-8 encoded Unicode.
> Again, supporting Unicode does not mean EVERYTHING must be Unicode. That
> would not make sense, at least not now. It may in the future. Unicode is
> here to stay.

  So was Microsoft. Almost all mentionings of "is here to stay" that I
have heard in last seven years were about Microsoft and its standards. I
hope people are now slowly starting to realize that this particular
monster is as little immortal as others were before it.

> >In
> >Windows/Office/... that lives and feeds on complex and unparceable formats
> >this problem couldn't appear or even thought of -- "text" doesn't exist as
> >text at all, and the less stuff will look as something that can be usable
> >outside of strict "object" environment, the better (they now don't even
> >encode it in UTF-8, and use bare 16-bit Unicode). In Unixlike system it's
> >a violation of some very basic rules.
> 
> What does Windows have to do with Unicode? Windows support for Unicode
> sucks royally. Except for NT, Windows' Unicode support is virtually
> non-existent.

  It takes a lot of ingenuity to screw up the very basic idea that was put
into the system design, however as we know Microsoft programmers are very
skilled at that. If you look at Microsoft APIs, filesystems and recent
document formats, the use of Unicode is in the very heart of them (and
being a amateurish conspiracy theorist I consider it to be one of their
means of interface obfuscation).

> When did it stop Unix programmers from doing something Microsoft cannot
> handle? Unix already handles Unicode better than anything under Windows.
> For example, Lynx handles Unicode quite well, and it does it on text-only
> displays that have no way of supporting a multitude of fonts.

  Unix handles all encodings well precisely because currently it's
encodings-independent, and adding the support for any of them is a
relatively small effort. However lacking the _infrastructure_ to support
charset/encoding/language information along with the text (MIME was a good
start but it was insufficient, and its development stopped too soon, so
it's horribly outdated now) it can become a "battleground of formats" just
like everything else is now, and it will be very sad if the at the result
the flexibility will be lost, and some bloated "standard" will emerge just
because a bunch of people were able to organize their "standard committee"
aggressive enough to silence everyone else, like it already happened at
IETF.

  I believe, the design of such infrastructure is much more important and
practical task than "adoption of Unicode" (that I regard as being just as
practical as conversion of /etc/passwd and output of ifconfig into XML,
adding embedded objects support in login prompt or rewriting init in
java).

-- 
Alex




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.4.20.0004041102320.6811-100000>