From owner-freebsd-hackers  Tue Jan 20 11:16:28 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id LAA25798
          for hackers-outgoing; Tue, 20 Jan 1998 11:16:28 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id LAA25765
          for <hackers@FreeBSD.ORG>; Tue, 20 Jan 1998 11:16:11 -0800 (PST)
          (envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.8.8/8.8.8) id MAA20151;
	Tue, 20 Jan 1998 12:16:02 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp04.primenet.com, id smtpd020090; Tue Jan 20 12:15:54 1998
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id MAA26214;
	Tue, 20 Jan 1998 12:15:41 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199801201915.MAA26214@usr04.primenet.com>
Subject: Re: Wide characters on tcp connections
To: jfieber@indiana.edu
Date: Tue, 20 Jan 1998 19:15:41 +0000 (GMT)
Cc: akm@mother.sneaker.net.au, louie@TransSys.COM, daniel_sobral@voga.com.br,
        tlambert@primenet.com, hackers@FreeBSD.ORG
In-Reply-To: <Pine.BSF.3.96.980120101241.26398Z-100000@fallout.campusview.indiana.edu> from "John Fieber" at Jan 20, 98 10:40:47 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk

> > | If you're looking for a standard way to move multibyte characters, then
> > | choose any one of a number of encodings already used to store multibyte
> > | characters in files.
> > 
> > Moving them's not quite the same as storing them.... byte orders, usually
> > come into play a lot more when you've got to shunt the data across a network.
> > 
> > I think Unicode defines that it is to be stored in network byte order.
> 
> Maybe this will clarify things a bit.  From _The Unicode Standard
> 2.0_, Section 3.1 Conformance Requirements: 
> 
> C1. A process shall interpret Unicode code values as 16-bit
>     quantities. 
> 
> C2. The Unicode Standard does not specify any order of bytes
>     inside a Unicode value.
>     
> C3. A process shall interpret a Unicode value that has been
>     serialized into a sequence of bytes, by most significant byte
>     first, in the absence of higher level protocols.
> 
> If you think of writing to a file as serializing, then C3
> applies.  If you think of it as dumping memory, then C2 applies. 
> I believe NT takes generally takes the C2 route. Terry, can you
> confirm this?  How about for IPC? 

For wide character strings for IPC, the character strings are sent
in native byte order with a byte order indicator.  This is consistent
with DCE RPC's XDR, and with the Microsoft bias toward Intel-centric
representation mechanisms.

I believe the File I/O interfaces also expect Intel byte order in the
files, so that they do not have to rewrite thier files for NT on
platforms with network byte order, as opposed to Intel byte order.

> Just as a footnote, UTF-8 is a big win for English text because
> it generally ends up 1 character == 1 byte, but is a big loss for
> CJK (among others) where 1 character == 3 bytes.  UTF-8 is no
> silver bullet for endian debates.

Any multibyte encoding is a loss for:

o	Fixed field storage
o	Forms input
o	Length-limited buffer technologies (like those in most
	modern computer languages in use today).
o	String length calculation

Etc.

8-(.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.