From owner-freebsd-hackers Tue Jan 20 07:42:44 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA10797 for hackers-outgoing; Tue, 20 Jan 1998 07:42:44 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA10792 for ; Tue, 20 Jan 1998 07:42:38 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.7/8.8.7) with SMTP id KAA00299; Tue, 20 Jan 1998 10:40:47 -0500 (EST) Date: Tue, 20 Jan 1998 10:40:47 -0500 (EST) From: John Fieber Reply-To: John Fieber To: Andrew Kenneth Milton cc: "Louis A. Mamakos" , daniel_sobral@voga.com.br, tlambert@primenet.com, hackers@FreeBSD.ORG Subject: Re: Wide characters on tcp connections In-Reply-To: <199801200415.PAA17887@mother.sneaker.net.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk On Tue, 20 Jan 1998, Andrew Kenneth Milton wrote: > | If you're looking for a standard way to move multibyte characters, then > | choose any one of a number of encodings already used to store multibyte > | characters in files. > > Moving them's not quite the same as storing them.... byte orders, usually > come into play a lot more when you've got to shunt the data across a network. > > I think Unicode defines that it is to be stored in network byte order. Maybe this will clarify things a bit. From _The Unicode Standard 2.0_, Section 3.1 Conformance Requirements: C1. A process shall interpret Unicode code values as 16-bit quantities. C2. The Unicode Standard does not specify any order of bytes inside a Unicode value. C3. A process shall interpret a Unicode value that has been serialized into a sequence of bytes, by most significant byte first, in the absence of higher level protocols. If you think of writing to a file as serializing, then C3 applies. If you think of it as dumping memory, then C2 applies. I believe NT takes generally takes the C2 route. Terry, can you confirm this? How about for IPC? Just as a footnote, UTF-8 is a big win for English text because it generally ends up 1 character == 1 byte, but is a big loss for CJK (among others) where 1 character == 3 bytes. UTF-8 is no silver bullet for endian debates. -john