Date: Tue, 20 Jan 1998 10:40:47 -0500 (EST) From: John Fieber <jfieber@indiana.edu> To: Andrew Kenneth Milton <akm@mother.sneaker.net.au> Cc: "Louis A. Mamakos" <louie@TransSys.COM>, daniel_sobral@voga.com.br, tlambert@primenet.com, hackers@FreeBSD.ORG Subject: Re: Wide characters on tcp connections Message-ID: <Pine.BSF.3.96.980120101241.26398Z-100000@fallout.campusview.indiana.edu> In-Reply-To: <199801200415.PAA17887@mother.sneaker.net.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 20 Jan 1998, Andrew Kenneth Milton wrote:
> | If you're looking for a standard way to move multibyte characters, then
> | choose any one of a number of encodings already used to store multibyte
> | characters in files.
>
> Moving them's not quite the same as storing them.... byte orders, usually
> come into play a lot more when you've got to shunt the data across a network.
>
> I think Unicode defines that it is to be stored in network byte order.
Maybe this will clarify things a bit. From _The Unicode Standard
2.0_, Section 3.1 Conformance Requirements:
C1. A process shall interpret Unicode code values as 16-bit
quantities.
C2. The Unicode Standard does not specify any order of bytes
inside a Unicode value.
C3. A process shall interpret a Unicode value that has been
serialized into a sequence of bytes, by most significant byte
first, in the absence of higher level protocols.
If you think of writing to a file as serializing, then C3
applies. If you think of it as dumping memory, then C2 applies.
I believe NT takes generally takes the C2 route. Terry, can you
confirm this? How about for IPC?
Just as a footnote, UTF-8 is a big win for English text because
it generally ends up 1 character == 1 byte, but is a big loss for
CJK (among others) where 1 character == 3 bytes. UTF-8 is no
silver bullet for endian debates.
-john
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.980120101241.26398Z-100000>
