From owner-freebsd-hackers  Mon Sep 18 13:38:04 1995
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.6.12/8.6.6) id NAA01209
          for hackers-outgoing; Mon, 18 Sep 1995 13:38:04 -0700
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id NAA01202
          for <hackers@freefall.freebsd.org>; Mon, 18 Sep 1995 13:37:59 -0700
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA08492; Mon, 18 Sep 1995 13:34:25 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199509182034.NAA08492@phaeton.artisoft.com>
Subject: Re: Policy on printf format specifiers?
To: bakul@netcom.com (Bakul Shah)
Date: Mon, 18 Sep 1995 13:34:25 -0700 (MST)
Cc: phk@critter.tfs.com, terry@lambert.org, hackers@freefall.freebsd.org
In-Reply-To: <199509181727.KAA09594@netcom10.netcom.com> from "Bakul Shah" at Sep 18, 95 10:26:58 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 3755      
Sender: owner-hackers@FreeBSD.org
Precedence: bulk

> > As far as I recall there is still some concern about Sanskrit and 10646
> > isn't there ?
> 
> Last I looked Unicode handled Sanskrit and other Indian
> languages fine.  [Indian languages support is dear to my
> heart so I looked into it back when Unicode-1 was being
> worked on -- AFAIK there have been no changes in this area
> since then]

Sanskrit is supported.  So is Tamil, Devengari, Hebrew, Arabic, etc.
The common factor on these is that they are ligatured languages, meaning
the glyph for a character will differ based on the location relative to
other glyphs.

For English speakers, the bes explanation is "cursive writing" and then
consider the number of ways you can connect the cursive letter 'e' to
other letters, based on if it's at the first of the word, end of the word,
or in the muddle of a word before or after a character like 'd', 'f', 'r',
'z', 'p', 'n', etc.

> Presumably Terry wants Unicode support in the kernel so that
> one can print kernel messages in any language.

No, I want it for file names in a Unicode aware file system.

I also want it for translation layers for remote mounts to Unicode
unaware file systems (most NFS systems today).

Finally, I want it for path name parsing translation for locally Unicode
aware/unaware user space applications and underlying file systems.

> While I agree with his sentiment IMHO we have a long way to go
> before that becomes critical.  We need a filesystem that'll
> support Unicode file names,

Got one.

> common applications need support for Unicode input/output etc.

Wrote an Xterm, have a 1M(!) 14 point font.  Barely ROMable.  8-).

> Hmm....  Support for reading/writing of Unicode filenames
> may be required in the kernel.  How else can you deal with
> code like
> 
> 	sprintf(name, "%s.core", p->p_comm);
> 
> where p_comm points to a Unicode filename?

Precisely.  Also:

#ifdef DIAGNOSTIC
	printf( "entering '%S' into cache\n", cnp->cn_cnp->pc_data);
#endif	/* DIAGNOSTIC*/

> Bruce writes:
> > I think wchar_t's were made 32 bits so that they are the same as rune_t's.
> > I don't know if this is important.
> 
> I too think 16 bit is good enough. 10646 is a 32 bit
> standard but given that other than Unicode no other pages
> are populated and that Unicode supports all living and many
> (most?) dead languages and that except for scholars of dead
> languages (a tiny tiny percentage of people) no one else
> will benefit *even if* pages beyond Unicode are ever used,
> allowing for such extension now is IMHO a waste of space.
> rune_t can be made 16 bit, too.

No reason to not leave rune_t 32 bits so as to not throw out dead language
support altogether.  I'd like to play around with Egyptian Heirogplyphics
and Linear B at some point (neither are supported by Unicode -- most dead
languages without modern antecedants aren't).

> Printf support for wchar_t (and wchar_t *) should really be
> specified by the standards people.  If they haven't, may be
> they should be petitioned.

I agree on that.  But I think it is also being taken for granted that
storage encoding will be distinct from process encoding.  I think that
this is a *big* mistake, for reasons pointed out in other posts.  This
implies either a content-based byte order translation (which I feel is
an unacceptable performance penalty) or a specification of a storage
encoding byte order on the premise that this will go over the wire.

Which is what led me to propose network byte order in the first place.

None of this would prevent switching from 16 to 32 bit wchar_t's at some
future date, were it to be found to be desirable.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.