From owner-freebsd-hackers Thu Jun 11 18:21:30 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id SAA24717 for freebsd-hackers-outgoing; Thu, 11 Jun 1998 18:21:30 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp02.primenet.com (daemon@smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id SAA24451 for ; Thu, 11 Jun 1998 18:20:06 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id SAA01296; Thu, 11 Jun 1998 18:19:36 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp02.primenet.com, id smtpd001280; Thu Jun 11 18:19:34 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id SAA06619; Thu, 11 Jun 1998 18:19:28 -0700 (MST) From: Terry Lambert Message-Id: <199806120119.SAA06619@usr09.primenet.com> Subject: Re: internationalization To: kline@tao.thought.org (Gary Kline) Date: Fri, 12 Jun 1998 01:19:28 +0000 (GMT) Cc: joy@urc.ac.ru, itojun@itojun.org, tlambert@primenet.com, hackers@FreeBSD.ORG In-Reply-To: <199806112234.PAA12768@tao.thought.org> from "Gary Kline" at Jun 11, 98 03:34:13 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Let me pose the same question, a bit more broadly. > Why cannot we support _both_ the ISO and Unicode > paradigms? Are these absolutely incompatible systems? > Is there some kind of ``religious-war''? Or is it > simply too difficult? ISO 10646 code page 0 *is* Unicode, by definition. The religious aspects have to do with the old trade-offs the various programmers are already used to, the new trade-offs the various programmers would have to start putting up with, and the various language bigotries people bring to the table. Major premise: everyone is going to have to put up with a non-8-bit wchar_t internally in their applications. This is called the "raw" or "process" representation. ---------------------------------------------------------------------- ---------------------------------------------------------------------- Are most of your files in ASCII? ---------------------------------------------------------------------- Then you want UTF7/UTF8/ISO2022 encoding, so you don't have to change them. Unless you plan to export your software. Let the non-English speaking world deal with the incompatabilities and storage bloat problems. You'll deal with it in your software when Japan and Europe "get their act together" and standardize on IBM-PC derived hardware so that your software won't have to be ported to run. Besides, C code is in the "C" locale, and that's US-ASCII already. GCC supports tri-glyphs, right? ---------------------------------------------------------------------- Are most of your files in ISO8859-X and/or KOI-8X? ---------------------------------------------------------------------- Then you don't want UTF7/UTF8, because if you get them, some characters that currently take up one byte will take up between one and three bytes (one if they are US ASCII, more if they are in the 0x80-0xff range). You also don't want ISO2022, because instead of simply choosing a locale for all your data, you will have to deal with character set shift processing. You could put up with UTF2, because you could do kernel magic to expand existing text files on existing filesystems by setting a per FS attribute that tells how to get the data in and out of Unicode representation. You still need a "magic doohickey" that tells the filesystem to do this for text files, but not for other files. ---------------------------------------------------------------------- Are most of your files in ISO2022-jp (JIS-208/JIS-212)? ---------------------------------------------------------------------- Then you don't want UTF7/UTF8/UTF2 encoding, because you don't want to have to convert your data. You don't want Unicode because it means you'll have to deal with the sorting problem all over again because Unicode's collation sequence isn't the JIS-208/JIS-212 collation sequence. You don't care about all the crap that goes withmultibyte encoding, because you've already dealt with all the bugs that causes in all your existing software already. You don't care about the storage bloat, because you already need as many bytes as the bloat will cause to store the characters in the character sets you use, so it doesn't matter to you that the code produced in other countries will bloat up and start evidencing bugs it didn't used to have before they tried to localize into your locale. ---------------------------------------------------------------------- ---------------------------------------------------------------------- What this boils down to is language bigotry, and whose language you prefer. Generally, the preference is either driven by personal or economic interests (like competitive advantage to your own locale from having your locale's preferred method chosen. The short sighted approach is to make the decision based on your own personal bigotry. The longer sighted approach is to make the decision which has the best workarounds for backward compatability and in-place conversion, and the least impact in the future based on the assumption that the software market is going to normalize all over the world at some point in the future, and you just may be around still and have to deal with it. Like the Y2K problem. Of course, this totally ignores the fact that Microsoft owns the world at the present time, and they've already made the correct long term decision on the assumption that they will be around forever and have to deal with it... another decision based on economic interests, in fact. If the aliens land, and we end up needing more than 2^16 characters in out wchar_t space, well, we can deal with that problem when it happens. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message