From owner-freebsd-hackers  Thu May  4 12:15:26 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.10/8.6.6) id MAA19782
          for hackers-outgoing; Thu, 4 May 1995 12:15:26 -0700
Received: from cs.weber.edu (cs.weber.edu [137.190.16.16])
          by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id MAA19776
          ; Thu, 4 May 1995 12:15:22 -0700
Received: by cs.weber.edu (4.1/SMI-4.1.1)
	id AA09917; Thu, 4 May 95 13:08:40 MDT
From: terry@cs.weber.edu (Terry Lambert)
Message-Id: <9505041908.AA09917@cs.weber.edu>
Subject: Re: Can someone explain the various forms of Japanese text encoding?
To: jkh@time.cdrom.com (Jordan K. Hubbard)
Date: Thu, 4 May 95 13:08:40 MDT
Cc: ache@FreeBSD.org, hackers@FreeBSD.org
In-Reply-To: <16984.799556850@time.cdrom.com> from "Jordan K. Hubbard" at May 3, 95 08:07:30 pm
X-Mailer: ELM [version 2.4dev PL52]
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

> So far I've seen "romanji", which appears to be a romanized form of
> Japanese, JIS (which is?) and "EUC" (which is?).  I'd like to support
> the "most standard" type for sysinstall, but I'm a little unclear as
> to just exactly what that might be.  Romanji looks like the easiest to
> display, but it's probably also the least palatable to the native
> Japanese speaker.  Given that I also have *no* Japanese fonts for
> syscons, I'm also somewhat limited in that dept. anyway.  There is a
> format I can display with the ISO8859-1 font, according to Satoshi,
> though I'm still a little unclear on how it works.

Romanji is the use of Latin letters and Romanization rules to provide
a phonetic spelling for Japanese that is generally useful for Gaijin
(foreigners/aliens) trying to get a speaking vocabulary or Japanese
trying to get a familiarity with Latin lettering and basic English
letter pronunciation.

JIS is "Japanese International Standard".  Most typically, it refers to
the JIS-208 character encoding standard, which contains many code points
for common Japanese ideaograms (English is alphabetic, Kana is phonetic,
and Kanji is ideogrammatic).  Ideograms represent one or more sylables
representing words (a phonetic alphabet is sometimes called a "syllabary"
because it contains only single syllables; Kanji is not a "syllabary"
since it can represent multiple syllables with a single ideogram).

JIS can also refer to the JIS-212 standard, which is an extension to the
208 standard and includes symbols not in 208.

EUC is a runic character encoding method.  In general, I hate runic
encoding because it destroys your ability to have mening ful file
sizes and drastically reduces the usability of fixed field length
storage and input mechanisms.  For instance, most English forms, such
as those used in standardized testing, have blanks for things like
your name, etc, with the blanks seperated on a per character basis.
Fixed field input on computers typically associates a screen length
and a buffer length, which predicts a 1:1 correspondance between the
encoding and the insternal (process coding).  It's understandable
when you could end up with 5 characters for a single symbol being
displayed.  The same problem occurs when you go to store the data
in a file... fixed fields can not be safely used.

It smacks of a conspiracy between the internationalizers and the guy
who wrote the VMS record oriented file system.  ;-).

The common encodings for JIS are EUC and shift-JIS, both runic encoding.
The EUC encoding is actually ISO 2022.

This is the encoding scheme recognized by XPG/4.

> I would welcome any suggestions or additional information!  I'm not
> exactly an expert in I18N issues, though I get the feeling that I'm
> going to know a lot more than I planned about this by the time I'm
> done! :)

I18N generally refers to 8-bit clean encoding used with ISO 8859-x
fonts, which are all 7 bit US ASCII with the additional characters
in the 0x80 and 0x90 (0x80-0x9f) culumns being considered as an
escape character plus the character minus 96 -- in other words, control
codes.  The remainder of the characters in that region (96 of them)
depend on which 8859 standard is used.  The 8859 standards are also
called the Latin character sets -- that is 8859-1 is frequently seen
referred to as Latin-1.

I18N encoding is used by XPG/3 (which can't handle non-8-bit encoded
languages).

There's a FAQ on this whole internationalization issue that is
frequently posted on comp.std.internat, comp.software.international,
and other standards related groups.  It is available for download from
the rftm FTP sites at mit and in the uk.


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.