From owner-freebsd-hackers Tue Apr 4 12: 7:58 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from phobos.illtel.denver.co.us (dsl-206.169.4.82.wenet.com [206.169.4.82]) by hub.freebsd.org (Postfix) with ESMTP id 82C7137B56E for ; Tue, 4 Apr 2000 12:07:54 -0700 (PDT) (envelope-from abelits@phobos.illtel.denver.co.us) Received: from localhost (abelits@localhost) by phobos.illtel.denver.co.us (8.9.3/8.9.3) with ESMTP id MAA09937; Tue, 4 Apr 2000 12:08:39 -0700 Date: Tue, 4 Apr 2000 12:08:39 -0700 (PDT) From: Alex Belits To: "G. Adam Stanislav" Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: Unicode on FreeBSD In-Reply-To: <3.0.6.32.20000404100544.00882db0@mail85.pair.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, 4 Apr 2000, G. Adam Stanislav wrote: > At 22:51 03-04-2000 -0700, Alex Belits wrote: > > I agree that Unicode created a good list of glyphs, and it can be > >useful for fonts and conversion tables, but it's completely inappropriate > >as the base of format used in real-life applications for storage and > >communications. > > Oh, I think it's great for communications. I design web sites. It is good > to have a single character representation supported by Internet standards. > Saves a lot of work. Before UTF-8 became widely accepted, a typical Slovak > web page started by a menu of choices of which encoding your browser > supported. You had to have 3 - 4 versions of each page. A major pain! Now > you only need one. This is a problem, however Unicode is not the only solution -- actually it's the worst of all solutions -- it solves simple problem only to create a lot of complex ones. > > Or even when designing English pages in a typographically correct way > (opening and closing quotes, and things like that), it was a pain before > UTF-8 because while ISO-8859-1 is the assumed default, Microsoft, in its > infinite wisdom created a slight modification of ISO-8859-1 which they > called ANSI, and which the uninitiated commonly believed to be the same as > ISO-8859-1. As a result, there are a myriad of web pages out there that use > the Microsoft encoding, and there are those that use true ISO-8859-1. So > many browsers assume that you are using the MS "standard." It's a real mess. Misrepresentation of one popular encoding in software of one company doesn't mean that it should be replaced with another, much more complex one, by everyone else. > > So, in all my recent pages I use UTF-8, and the problem is solved. > > >> Unicode Consortium > >> has no power to force Unicode on anyone. It just happens that it was widely > >> accepted. > > > > So far only by one company actually "accepted" it -- Microsoft. Everyone > >else (except Java/Sun) just happened to be depended on them. Java and > >Plan9 are special cases because both are essentially endless storages of > >ivory-tower design idiosyncrasy and arbitrary decisions made by handful of > >people. > > I was not talking about companies. I was talking about people with genuine > i18n needs. People with genuine i18n needs such as linguists or people with genuine i18n needs such as non-English users? Linguists don't see Unicode as being sufficient, and everyone else uses local encodings/charsets. I agree that local encodings are very limiting in the form they exist now, however they, not Unicode, are standards used in real life. If some encapsulation format (not as limited as iso 2022 and not as restrictive as MIME multipart) will be created to support multiple charsets/encodings/languages in one document in labeled chunks, the same problem would be solved with minimal changes in existing software and minimal document conversion efforts. This solution will be far superior to Unicode, and even for "web" use it can be made compatible with charsets support in existing browsers. [skipped without much of disagreement] > Again, it's not about "adoption" of Unicode, it's about supporting Unicode > for those who need it. Going Unicode-only would not be wise, but I don't > see anyone here suggesting that. After looking at what happened to IETF documents, XML and perl I can only come to conclusion that Unicode, once included in some system that didn't have multiple-charset document support infrastructure before that, starts requiring more and more sacrifices to be supported decently until the support of other encodings becomes impossible or significantly more difficult than support of Unicode. I am not against the support of any charset, encoding or language used in the real world, Unicode included. However after seeing how Unicode "support" efforts quickly turn into "adoption" all across the libraries/protocols/applications layers, I believe that only if some decent charset/encoding/language labeling infrastructure will be developed, it will be possible to contain charsets and prevent their "leaking" to application level. Leaking of ASCII (infamous 7-bit restriction that was present for no understandable reason in a lot of protocols and utilities) was a painful enough experience already, and it looks like it's fixed in most of stuff by now. Leaking of local charsets (especially iso 8859-1 and its modifications) was bad, however it was mostly prevented by locale support (even though it is clumsy and unusable in multilingual documents). Leaking of Unicode and UTF-8 can start something even worse because it's already evident that many applications written to support UTF-8 character format, have the hardcoded assumption of this format in their i/o and parsing routines that otherwise are supposed to be either charset-blind, or use external, charset-dependent routines to determine characters boundaries. I don't want to be misunderstood as the opponent of all things Unicode -- as I have said, its support is useful. However I oppose: 1. The point of view that Unicode is the only possible or the best possible way to handle multilingual documents. 2. The point of view that support of Unicode should be made at the expense of compatibility with everything else, or by the introduction of some unsafe guesswork such as application of UTF-8 validity check to determine if the chunk of data is in UTF-8 or not. I see the "support" or "adoption" of Unicode as a threat only if it will be made based on those ideas, and I think that the development of charset/encoding/language labeling or encapsulation format and handling routines, even if it will not be "blessed" by IETF or TOG, will provide means of safe, compatible and relatively easy handling of multilingual documents, including ones that are completely or in part are in Unicode. Unicode documents themselves suffer from the lack of language-labeling information, and there is (currently unused however "standardized") way to label _language_ (not charset, subset or encoding) within the Unicode text. It's not used because it contradicts with the idea of "easy", completely stateless and non-encapsulated Unicode text, so its support is allmost completely impossible in existing Unicode support infrastructures. Instead language labeling is pushed up into XML (or other formats) parsers and applications thus making it application-dependent and ultimately unreliable. I think that if some more reasonable labeling (encapsulation, metadata or attributes handling -- in whatever way it will be called) system will be created for text "documents", it can solve this problem by just assigning charset, encoding and language to pieces of text, and leaving "unknown" or unattributed text alone, not allowing language-specific or charset-dependent routines to touch it. In system like this Unicode will be labeled as Unicode, UTF-8 will be labeled as UTF-8, and Russian language will be labeled as Russian language independently, thus allowing to build a languages support infrastructure that in most of places can use existing formats safely as languages will be clearly marked where known, no guesswork will be applied, and no conversion to Unicode (or anything else) will be required. -- Alex ---------------------------------------------------------------------- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message