From owner-freebsd-hackers  Tue Jan 20 07:42:44 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id HAA10797
          for hackers-outgoing; Tue, 20 Jan 1998 07:42:44 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA10792
          for <hackers@FreeBSD.ORG>; Tue, 20 Jan 1998 07:42:38 -0800 (PST)
          (envelope-from jfieber@indiana.edu)
Received: from localhost (jfieber@localhost)
	by fallout.campusview.indiana.edu (8.8.7/8.8.7) with SMTP id KAA00299;
	Tue, 20 Jan 1998 10:40:47 -0500 (EST)
Date: Tue, 20 Jan 1998 10:40:47 -0500 (EST)
From: John Fieber <jfieber@indiana.edu>
Reply-To: John Fieber <jfieber@indiana.edu>
To: Andrew Kenneth Milton <akm@mother.sneaker.net.au>
cc: "Louis A. Mamakos" <louie@TransSys.COM>, daniel_sobral@voga.com.br,
        tlambert@primenet.com, hackers@FreeBSD.ORG
Subject: Re: Wide characters on tcp connections
In-Reply-To: <199801200415.PAA17887@mother.sneaker.net.au>
Message-ID: <Pine.BSF.3.96.980120101241.26398Z-100000@fallout.campusview.indiana.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk

On Tue, 20 Jan 1998, Andrew Kenneth Milton wrote:

> | If you're looking for a standard way to move multibyte characters, then
> | choose any one of a number of encodings already used to store multibyte
> | characters in files.
> 
> Moving them's not quite the same as storing them.... byte orders, usually
> come into play a lot more when you've got to shunt the data across a network.
> 
> I think Unicode defines that it is to be stored in network byte order.

Maybe this will clarify things a bit.  From _The Unicode Standard
2.0_, Section 3.1 Conformance Requirements: 

C1. A process shall interpret Unicode code values as 16-bit
    quantities. 

C2. The Unicode Standard does not specify any order of bytes
    inside a Unicode value.
    
C3. A process shall interpret a Unicode value that has been
    serialized into a sequence of bytes, by most significant byte
    first, in the absence of higher level protocols.

If you think of writing to a file as serializing, then C3
applies.  If you think of it as dumping memory, then C2 applies. 
I believe NT takes generally takes the C2 route. Terry, can you
confirm this?  How about for IPC? 

Just as a footnote, UTF-8 is a big win for English text because
it generally ends up 1 character == 1 byte, but is a big loss for
CJK (among others) where 1 character == 3 bytes.  UTF-8 is no
silver bullet for endian debates.

-john