From owner-freebsd-hackers Thu May 4 12:15:26 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.10/8.6.6) id MAA19782 for hackers-outgoing; Thu, 4 May 1995 12:15:26 -0700 Received: from cs.weber.edu (cs.weber.edu [137.190.16.16]) by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id MAA19776 ; Thu, 4 May 1995 12:15:22 -0700 Received: by cs.weber.edu (4.1/SMI-4.1.1) id AA09917; Thu, 4 May 95 13:08:40 MDT From: terry@cs.weber.edu (Terry Lambert) Message-Id: <9505041908.AA09917@cs.weber.edu> Subject: Re: Can someone explain the various forms of Japanese text encoding? To: jkh@time.cdrom.com (Jordan K. Hubbard) Date: Thu, 4 May 95 13:08:40 MDT Cc: ache@FreeBSD.org, hackers@FreeBSD.org In-Reply-To: <16984.799556850@time.cdrom.com> from "Jordan K. Hubbard" at May 3, 95 08:07:30 pm X-Mailer: ELM [version 2.4dev PL52] Sender: hackers-owner@FreeBSD.org Precedence: bulk > So far I've seen "romanji", which appears to be a romanized form of > Japanese, JIS (which is?) and "EUC" (which is?). I'd like to support > the "most standard" type for sysinstall, but I'm a little unclear as > to just exactly what that might be. Romanji looks like the easiest to > display, but it's probably also the least palatable to the native > Japanese speaker. Given that I also have *no* Japanese fonts for > syscons, I'm also somewhat limited in that dept. anyway. There is a > format I can display with the ISO8859-1 font, according to Satoshi, > though I'm still a little unclear on how it works. Romanji is the use of Latin letters and Romanization rules to provide a phonetic spelling for Japanese that is generally useful for Gaijin (foreigners/aliens) trying to get a speaking vocabulary or Japanese trying to get a familiarity with Latin lettering and basic English letter pronunciation. JIS is "Japanese International Standard". Most typically, it refers to the JIS-208 character encoding standard, which contains many code points for common Japanese ideaograms (English is alphabetic, Kana is phonetic, and Kanji is ideogrammatic). Ideograms represent one or more sylables representing words (a phonetic alphabet is sometimes called a "syllabary" because it contains only single syllables; Kanji is not a "syllabary" since it can represent multiple syllables with a single ideogram). JIS can also refer to the JIS-212 standard, which is an extension to the 208 standard and includes symbols not in 208. EUC is a runic character encoding method. In general, I hate runic encoding because it destroys your ability to have mening ful file sizes and drastically reduces the usability of fixed field length storage and input mechanisms. For instance, most English forms, such as those used in standardized testing, have blanks for things like your name, etc, with the blanks seperated on a per character basis. Fixed field input on computers typically associates a screen length and a buffer length, which predicts a 1:1 correspondance between the encoding and the insternal (process coding). It's understandable when you could end up with 5 characters for a single symbol being displayed. The same problem occurs when you go to store the data in a file... fixed fields can not be safely used. It smacks of a conspiracy between the internationalizers and the guy who wrote the VMS record oriented file system. ;-). The common encodings for JIS are EUC and shift-JIS, both runic encoding. The EUC encoding is actually ISO 2022. This is the encoding scheme recognized by XPG/4. > I would welcome any suggestions or additional information! I'm not > exactly an expert in I18N issues, though I get the feeling that I'm > going to know a lot more than I planned about this by the time I'm > done! :) I18N generally refers to 8-bit clean encoding used with ISO 8859-x fonts, which are all 7 bit US ASCII with the additional characters in the 0x80 and 0x90 (0x80-0x9f) culumns being considered as an escape character plus the character minus 96 -- in other words, control codes. The remainder of the characters in that region (96 of them) depend on which 8859 standard is used. The 8859 standards are also called the Latin character sets -- that is 8859-1 is frequently seen referred to as Latin-1. I18N encoding is used by XPG/3 (which can't handle non-8-bit encoded languages). There's a FAQ on this whole internationalization issue that is frequently posted on comp.std.internat, comp.software.international, and other standards related groups. It is available for download from the rftm FTP sites at mit and in the uk. Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.