Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Jan 1998 23:01:35 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        Pierre.Beyssac@hsc.fr (Pierre Beyssac)
Cc:        tlambert@primenet.com, Pierre.Beyssac@hsc.fr, louie@TransSys.COM, daniel_sobral@voga.com.br, hackers@FreeBSD.ORG
Subject:   Re: Wide characters on tcp connections
Message-ID:  <199801212301.QAA10692@usr09.primenet.com>
In-Reply-To: <19980121103354.EB02816@mars.hsc.fr> from "Pierre Beyssac" at Jan 21, 98 10:33:54 am

next in thread | previous in thread | raw e-mail | index | archive | help
> [ UTF-8 ]
> > It will take up to 3 bytes to resync, since it can take up to 5
> > bytes to represent a single 16 bit value.
> 
> I assume you mean 32 bit? I think (don't have the draft handy) that's
> a little more complicated than that, because there if I remember
> correctly there are "collisions" between prefix codes and multibyte
> encodings. But that's the idea.

Yes, you're right; a UCS-4 value, not a UCS-2 value.  UCS-2 values
will take 1-2 to resync.

0x00000000 - 0x0000007F:
0xxxxxxx

0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx

0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

> > This assumes you are willing to push an arbitrary number of bytes
> > to get a 16 bit value to the other end of the pipe, and that you are
> > willing to take the computational overhead of the conversion,
> 
> Yes, but you have to take a computational overhead anyway, even
> with fixed width characters, if you are to convert to network
> byte order.

Which is why you send in host byte order; then the target machine
gets the data, if it's not the same byteorder as you, then it takes
the hit.  That's why DCE/RPC sends in host order, which may or may
not be net order.

> > and
> > that you are willing to treat your values as a stream instead of
> > an external data representation of a structure (ie: you are willling
> > to give up being able to tell the other end to expect a certain number
> > of bytes in a transaction).
> 
> In the case of a telnet connection or mainly ASCII transfer, this makes
> sense: I certainly don't feel like I'm ready to take a fourfold
> performance loss due to wider characters :-)

It's twofold if you encode as UCS-2 instead of UTF-ing it.  Also,
you don't need to content-transfer encode as UCS-2.  The point is
that you are able to round-trip the data in and out.

The biggest hit will be application data size, and data dictionary
size.

For Chinese/Japanese documents, UTF-8 is asking them to send 1.5
times as much data as they would otherwise.

Personally, I'd prefer that all applications be capable of being
localized, and I'm willing to pay the penalty for it.

You should also remember that storage encoding is not necessarily
ultilization encoding.  You could, for example, UTF-8 data in
the FS, but expose it as UCS-2, or a given locale character set.
Clearly, NFS doesn't support a Unicode namespace; you would need
to round-trip it before exposing it; for example, an NTFS exposed
as an NFS mount.


Of course, if you do this "under the covers", a locale character
set is your best choice, and then you can internally attribute the
file as "stored in XXX locale form".  With a linear 1->2 2->1
conversion, you could still support mmap()'ing in applications,
and applications are supposed to internally use wchar_t.  So an
exposure via mmap() is sort of antithetical to UTF-8 as well.

For FTP... well... "set UTF" comes to mind.  Seems a poor choice
of a "compression algorithm", though.  ;-).


> When putting this in a database system, you obviously don't _have_ to
> use UTF-8 internally, that's purely an implementation issue.
> 
> Now I agree using UTF-8 in RPCs can be difficult, but after all isn't
> the RPC layer supposed to hide exactly these kinds of things from
> the application programmer?

Depends.  What does "MAXPATHLEN" mean?  8-) 8-).

> I was just pointing out that it would be silly to reinvent the wheel
> if that's to come up with something similar to UTF-8.

Well, it would certainly be silly to come up with something similar
to UTF-8... for example, UTF-8.  ;-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199801212301.QAA10692>