Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 04 Apr 2000 10:05:44 -0500
From:      "G. Adam Stanislav" <adam@whizkidtech.net>
To:        Alex Belits <abelits@phobos.illtel.denver.co.us>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Unicode on FreeBSD
Message-ID:  <3.0.6.32.20000404100544.00882db0@mail85.pair.com>
In-Reply-To: <Pine.LNX.4.10.10004032159530.890-100000@mercury>
References:  <3.0.6.32.20000403233641.008e6590@mail85.pair.com>

next in thread | previous in thread | raw e-mail | index | archive | help
At 22:51 03-04-2000 -0700, Alex Belits wrote:
>  I agree that Unicode created a good list of glyphs, and it can be
>useful for fonts and conversion tables, but it's completely inappropriate
>as the base of format used in real-life applications for storage and
>communications.

Oh, I think it's great for communications. I design web sites. It is good
to have a single character representation supported by Internet standards.
Saves a lot of work. Before UTF-8 became widely accepted, a typical Slovak
web page started by a menu of choices of which encoding your browser
supported. You had to have 3 - 4 versions of each page. A major pain! Now
you only need one.

Or even when designing English pages in a typographically correct way
(opening and closing quotes, and things like that), it was a pain before
UTF-8 because while ISO-8859-1 is the assumed default, Microsoft, in its
infinite wisdom created a slight modification of ISO-8859-1 which they
called ANSI, and which the uninitiated commonly believed to be the same as
ISO-8859-1. As a result, there are a myriad of web pages out there that use
the Microsoft encoding, and there are those that use true ISO-8859-1. So
many browsers assume that you are using the MS "standard." It's a real mess.

So, in all my recent pages I use UTF-8, and the problem is solved.

>> Unicode Consortium
>> has no power to force Unicode on anyone. It just happens that it was widely
>> accepted.
>
>  So far only by one company actually "accepted" it -- Microsoft. Everyone
>else (except Java/Sun) just happened to be depended on them. Java and
>Plan9 are special cases because both are essentially endless storages of
>ivory-tower design idiosyncrasy and arbitrary decisions made by handful of
>people.

I was not talking about companies. I was talking about people with genuine
i18n needs. When I started working on Unicode support for FreeBSD (a work,
I unfortunately had to interrupt due to serious health problems), I
subscribed to the Unicode mailing list. People on the list come from
different backgrounds, mostly Unix actually. The most active ones who make
serious proposals to additions to Unicode are Unix people.

>  I have just asked, who will benefit from it. No one answered "I will" --
>everyone who makes Unicode support believes that it will benefit someone
>else.

I thought I did. OK, let me restate: I will! I actually do already because
I did some work and it is in the ports.

>  I am not talking about Unicode representations and encodings but about
>Unicode itself. I agree that UTF-8 is the only way to marry Unicode with
>text and Unix, however I don't see much point in doing that.

Well, that's fine. You don't need it. I do. UTF-8 has many nice advantages
for a Unix programmer, which is probably why it became so widely accepted.
For example, standard C string functions work with UTF-8: strcmp, strcpy,
and other str* functions work without modification. The only possible
limitation is that strlen will give you the number of bytes rather than the
number of characters, but that is probably the intended meaning anyway
(e.g., if you need to see how much memory you need to store a UTF-8 string,
its strlen + 1 will still work as intended).

UTF-8 also transparently supports both the original Unicode which is 16
bits wide and the new ISO 10646 which is 31 bits wide.


>> >  So I don't want UTF-8 to be forced on me.
>> 
>> Who's forcing it on you?
>
>  IETF. All recent RFCs are littered with referenced to UTF-8 in all 
>places where reasonable standards would have "8-bit clean" with no
>explicit low-level semantics attached.

All they say is that UTF-8 must be supported by all protocols. They don't
say other encodings must not or should not. If you need the clean bit, use
UTF-7. You can still use MIME.

Personally, I see no problem designing a file that can mix text data with
other data. The "control" characters still exist in Unicode, so it is easy
to a control character followed by the size of data to delimit the start of
binary data.

>  I have spent enough time with "unicoders" to become convinced that the
>depth of changes they demand in protocols and libraries is enough to make
>it a game of "everything or nothing" -- partial implementations become
>unsafe because the design of libraries and prococols hinges on the idea
>that only one charset/encoding may exist, so no ways to provide charset
>and encoding are left.

I have not encountered that attitude. I have seen people who see the
advantages of Unicode to the point they do not use anything else in their
work, but I do not see them trying to force everyone else to go Unicode only.

>  This is the problem. There is no "text" and "non-text" -- there is
>"valid UTF-8" and everything else. Software designed in "unix style"
>can't do heuristics and guess that if the data has some properties (such
>as passing UTF-8 validity test) it is really some particular kind of data
>and should be treated in some different manner.

It does not need to.

>> Again, supporting Unicode does not mean EVERYTHING must be Unicode. That
>> would not make sense, at least not now. It may in the future. Unicode is
>> here to stay.
>
>  So was Microsoft. Almost all mentionings of "is here to stay" that I
>have heard in last seven years were about Microsoft and its standards.

I never said MS was here to stay. I personally do believe Unicode is. Not
because it is perfect, mind you. There are many design flaws in Unicode. In
a way, Unicode was a quick hack. For example, in the old ASCII you could
easily convert a lower case letter to upper case by modifying a single bit.
You can't do that in Unicode. But Unicode is so widespread by now that
trying to create an alternative would most likely cause more problems that
solve. So, it is here to stay, for better or worse. Perhaps not forever,
but for a reasonable period of time.

>  It takes a lot of ingenuity to screw up the very basic idea that was put
>into the system design, however as we know Microsoft programmers are very
>skilled at that. If you look at Microsoft APIs, filesystems and recent
>document formats, the use of Unicode is in the very heart of them (and
>being a amateurish conspiracy theorist I consider it to be one of their
>means of interface obfuscation).

Yes, it's in their API but it still sucks. The main problem is that that
API only really works on NT, so a programmer who wants to support both NT
and the 95/98/2000 variety really cannot use the Unicode variety (not to
mention it is limited to 16 bits, so it does not support ISO 10646).

>  Unix handles all encodings well precisely because currently it's
>encodings-independent, and adding the support for any of them is a
>relatively small effort.

Then there should be no problem adding Unicode support, right? :)

>  I believe, the design of such infrastructure is much more important and
>practical task than "adoption of Unicode" (that I regard as being just as
>practical as conversion of /etc/passwd and output of ifconfig into XML,
>adding embedded objects support in login prompt or rewriting init in
>java).

Again, it's not about "adoption" of Unicode, it's about supporting Unicode
for those who need it. Going Unicode-only would not be wise, but I don't
see anyone here suggesting that.

Cheers,
Adam
-----------------------------------------------------------
"I think, therefore I am."
                    - Seventeenth Century Philosophy

"I publish what I think, therefore I have."
                    - Twenty-First Century Action

Details at http://www.OnlinePublisher.net/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3.0.6.32.20000404100544.00882db0>