From owner-freebsd-hackers Tue Jan 20 11:16:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id LAA25798 for hackers-outgoing; Tue, 20 Jan 1998 11:16:28 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id LAA25765 for ; Tue, 20 Jan 1998 11:16:11 -0800 (PST) (envelope-from tlambert@usr04.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id MAA20151; Tue, 20 Jan 1998 12:16:02 -0700 (MST) Received: from usr04.primenet.com(206.165.6.204) via SMTP by smtp04.primenet.com, id smtpd020090; Tue Jan 20 12:15:54 1998 Received: (from tlambert@localhost) by usr04.primenet.com (8.8.5/8.8.5) id MAA26214; Tue, 20 Jan 1998 12:15:41 -0700 (MST) From: Terry Lambert Message-Id: <199801201915.MAA26214@usr04.primenet.com> Subject: Re: Wide characters on tcp connections To: jfieber@indiana.edu Date: Tue, 20 Jan 1998 19:15:41 +0000 (GMT) Cc: akm@mother.sneaker.net.au, louie@TransSys.COM, daniel_sobral@voga.com.br, tlambert@primenet.com, hackers@FreeBSD.ORG In-Reply-To: from "John Fieber" at Jan 20, 98 10:40:47 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk > > | If you're looking for a standard way to move multibyte characters, then > > | choose any one of a number of encodings already used to store multibyte > > | characters in files. > > > > Moving them's not quite the same as storing them.... byte orders, usually > > come into play a lot more when you've got to shunt the data across a network. > > > > I think Unicode defines that it is to be stored in network byte order. > > Maybe this will clarify things a bit. From _The Unicode Standard > 2.0_, Section 3.1 Conformance Requirements: > > C1. A process shall interpret Unicode code values as 16-bit > quantities. > > C2. The Unicode Standard does not specify any order of bytes > inside a Unicode value. > > C3. A process shall interpret a Unicode value that has been > serialized into a sequence of bytes, by most significant byte > first, in the absence of higher level protocols. > > If you think of writing to a file as serializing, then C3 > applies. If you think of it as dumping memory, then C2 applies. > I believe NT takes generally takes the C2 route. Terry, can you > confirm this? How about for IPC? For wide character strings for IPC, the character strings are sent in native byte order with a byte order indicator. This is consistent with DCE RPC's XDR, and with the Microsoft bias toward Intel-centric representation mechanisms. I believe the File I/O interfaces also expect Intel byte order in the files, so that they do not have to rewrite thier files for NT on platforms with network byte order, as opposed to Intel byte order. > Just as a footnote, UTF-8 is a big win for English text because > it generally ends up 1 character == 1 byte, but is a big loss for > CJK (among others) where 1 character == 3 bytes. UTF-8 is no > silver bullet for endian debates. Any multibyte encoding is a loss for: o Fixed field storage o Forms input o Length-limited buffer technologies (like those in most modern computer languages in use today). o String length calculation Etc. 8-(. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.