From owner-freebsd-hackers  Wed Jan 21 02:23:36 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id CAA11306
          for hackers-outgoing; Wed, 21 Jan 1998 02:23:36 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from itesec.hsc.fr (root@itesec.hsc.fr [192.70.106.33])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA11291
          for <hackers@freebsd.org>; Wed, 21 Jan 1998 02:23:22 -0800 (PST)
          (envelope-from pb@hsc.fr)
Received: from mars.hsc.fr (pb@mars.hsc.fr [192.70.106.44])
	by itesec.hsc.fr (8.8.8/8.8.5/itesec-1.10-nospam) with ESMTP id KAA29968;
	Wed, 21 Jan 1998 10:33:56 +0100 (MET)
Received: (from pb@localhost)
	by mars.hsc.fr (8.8.5/8.8.5/pb-19970301) id KAA14368;
	Wed, 21 Jan 1998 10:33:55 +0100 (MET)
Message-ID: <19980121103354.EB02816@mars.hsc.fr>
Date: Wed, 21 Jan 1998 10:33:54 +0100
From: Pierre.Beyssac@hsc.fr (Pierre Beyssac)
To: tlambert@primenet.com (Terry Lambert)
Cc: Pierre.Beyssac@hsc.fr (Pierre Beyssac), louie@TransSys.COM,
        daniel_sobral@voga.com.br, hackers@FreeBSD.ORG
Subject: Re: Wide characters on tcp connections
References: <19980120120216.OB37901@mars.hsc.fr> <199801202118.OAA27310@usr06.primenet.com>
X-Mailer: Mutt 0.59.1e
Mime-Version: 1.0
In-Reply-To: <199801202118.OAA27310@usr06.primenet.com>; from Terry Lambert on Jan 20, 1998 21:18:36 +0000
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk

According to Terry Lambert:
[ UTF-8 ]
> It will take up to 3 bytes to resync, since it can take up to 5
> bytes to represent a single 16 bit value.

I assume you mean 32 bit? I think (don't have the draft handy) that's
a little more complicated than that, because there if I remember
correctly there are "collisions" between prefix codes and multibyte
encodings. But that's the idea.

> This assumes you are willing to push an arbitrary number of bytes
> to get a 16 bit value to the other end of the pipe, and that you are
> willing to take the computational overhead of the conversion,

Yes, but you have to take a computational overhead anyway, even
with fixed width characters, if you are to convert to network
byte order.

> and
> that you are willing to treat your values as a stream instead of
> an external data representation of a structure (ie: you are willling
> to give up being able to tell the other end to expect a certain number
> of bytes in a transaction).

In the case of a telnet connection or mainly ASCII transfer, this makes
sense: I certainly don't feel like I'm ready to take a fourfold
performance loss due to wider characters :-)

When putting this in a database system, you obviously don't _have_ to
use UTF-8 internally, that's purely an implementation issue.

Now I agree using UTF-8 in RPCs can be difficult, but after all isn't
the RPC layer supposed to hide exactly these kinds of things from
the application programmer?

> The people who like UTF encoding are the people who've already had
> thier mail forwarded to Hell,

I'm quite sure you mean X400 :-). Don't worry about me, I'm not a
UTF-8 specialist, not a UTF-8 user and even less a UTF-8 advocate
(not to mention I hate X400).

I was just pointing out that it would be silly to reinvent the wheel
if that's to come up with something similar to UTF-8.
-- 
Pierre.Beyssac@hsc.fr