Date: Tue, 20 Jan 1998 10:40:47 -0500 (EST) From: John Fieber <jfieber@indiana.edu> To: Andrew Kenneth Milton <akm@mother.sneaker.net.au> Cc: "Louis A. Mamakos" <louie@TransSys.COM>, daniel_sobral@voga.com.br, tlambert@primenet.com, hackers@FreeBSD.ORG Subject: Re: Wide characters on tcp connections Message-ID: <Pine.BSF.3.96.980120101241.26398Z-100000@fallout.campusview.indiana.edu> In-Reply-To: <199801200415.PAA17887@mother.sneaker.net.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 20 Jan 1998, Andrew Kenneth Milton wrote: > | If you're looking for a standard way to move multibyte characters, then > | choose any one of a number of encodings already used to store multibyte > | characters in files. > > Moving them's not quite the same as storing them.... byte orders, usually > come into play a lot more when you've got to shunt the data across a network. > > I think Unicode defines that it is to be stored in network byte order. Maybe this will clarify things a bit. From _The Unicode Standard 2.0_, Section 3.1 Conformance Requirements: C1. A process shall interpret Unicode code values as 16-bit quantities. C2. The Unicode Standard does not specify any order of bytes inside a Unicode value. C3. A process shall interpret a Unicode value that has been serialized into a sequence of bytes, by most significant byte first, in the absence of higher level protocols. If you think of writing to a file as serializing, then C3 applies. If you think of it as dumping memory, then C2 applies. I believe NT takes generally takes the C2 route. Terry, can you confirm this? How about for IPC? Just as a footnote, UTF-8 is a big win for English text because it generally ends up 1 character == 1 byte, but is a big loss for CJK (among others) where 1 character == 3 bytes. UTF-8 is no silver bullet for endian debates. -john
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.980120101241.26398Z-100000>