Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Dec 1997 16:53:50 +0000 (GMT)
From:      John Sullivan <johns@chiark.greenend.org.uk>
To:        Stefan Esser <se@freebsd.org>
Cc:        freebsd-current@freebsd.org, The Hermit Hacker <scrappy@hub.org>
Subject:   Re: Wine Emulator Patch...
Message-ID:  <Pine.LNX.3.96.971224152325.28787A-100000@chiark.greenend.org.uk>
In-Reply-To: <19971224110019.23782@mi.uni-koeln.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 24 Dec 1997, Stefan Esser wrote:
>On 1997-12-23 17:53 -0500, The Hermit Hacker <scrappy@hub.org> wrote:
>> around this deficiency (as did I when I built it onto my computer), but I
>> got into a discussion with the developers about this in the newsgroup, and
>> have been informed that this is, in fact, wrong :(

>Well, and you trust that information ? :)

Oh I see - my opinion can't possibly be right because I expressed it on
USENET. ;)

>If you are working in an 8bit locale, then there
>is no problem.
>
>If somebody is working in a non-8bit locale, and 
>there is a problem, please let me know!

Well the problem as I see it is that the functions in question were being
used to support Windows' Unicode conversion/manipulation functions (a
16-bit encoding of a 16-bit character set). Unless you like lots of
1-character truncated strings, there most definitely *is* a problem with
using standard MBCS routines to implement these. 

Even without the character set difference (locale defined vs. UCS2), a
wide character string is *not* the same as an MBCS encoded string, ever,
under any (8 or >8 bit) locale. A single wide character may or may not be
the same as some ordering of the component MBCS bytes joined together, but
who knows? 

>> 	My argument was weak to start off with, in that I didnt' believe
>> that anything other then Linux had this, and that putting wctype.h as part
>> of the distribution made it more Linux-only...except that other OSs
>> (Solaris, AIX, etc) do have a wctype.h file, so why are we missing it?

For the record, this is a comment from the Linux/GNU libc header:

/*
 *      ISO/IEC 9899:1990/Amendment 1:1995 7.15:
 *      Wide-character classification and mapping utilities  <wctype.h>
 */

>Perhaps, because some of the FreeBSD developers 
>already spent a lot of time with support for 
>wide character locales ? ;-)

Eh? I'm not sure I understand what you're implying here.

>> >       If you go into misc/lstr.c with vi, do a search/replace of:
>> >
>> >       tow -> to
>> >       isw -> is

>> Well, it probably will compile, but the two sets of functions are
>> *supposed* to be different.

>Supposed ???
>Don't think so. Why can't the simple is__() / to__() 
>function get it right ?

Because MBCS functions assume an array of bytes, where a single character
is represented by a possibly variable, >=1 number of bytess. Wide
character functions assume an array of fixed size elements each
representing a single character. The two sets are rarely interchangable.

It turns out now that I don't believe the tow__() version are entirely
suitable either, but with a 16 bit wctype_t they will do the right thing
more often than the MBCS versions.

>Sure. And just check out what FreeBSD has in
>/usr/include/ctype.h (simplified, I'm using 
>tolower here, others are similar) :

>There exist two versions of that function,
>one (compiled without XPG4) in libc, the 
>other (compiled with XPG4) in libxpg4. We
>only need to add -lxpg4 to the linker 
>command line, and the full multi-byte range
>should be supported. I'll try this on my 
>system, later today, and will then commit 
>the patch to the Wine port.

Hmm. How do you get the value to pass to tolower()? I guess if you can
assume SBCS, tolower(*str) will work. If you're working in an MBCS locale
you don't know how many bytes to extract from the string, so the only
sensible thing to do is tolower(mbtowc()), but that's wrong - given that
you know the locale is MBCS you can't necessarily assume tolower will do
the right thing - you need towlower.

Passing in a UCS2 character will appear to work a lot of the time. If your
locale is latin-1, you'll get most of the 8-bit characters right. Under
any other locale you'll at least get the 7-bit characters right. There are
an awful lot of characters in various locales, though, that have >8-bit
encodings in Unicode that you'll get wrong, and not even all of latin-1's
code points are the same as UCS2. 

Do you work under a latin-1 locale by any chance? (I usually do btw.)

>Well, it's there in ctype.h ...
>I don't see, why another header is required. Even if 
>they stick with the towlower() call (most probably 
>because tolower() can't deal with wide characters),
>those definitions could have gone into ctype.h ...

Oh, no real reason as far as I can tell. But then, no system ever needs
more than 1 header file. Stick it *all* in stdio.h! The standard probably
says these functions can be found in wctype.h, so that's really where they
should go.

>FreeBSD already contains a rather complete set of 
>string functions on wide and multi-byte characters.
>See "man 3 multibyte" or "man 3 mbrune" for more
>information.

I don't dispute this. You can put mappings from tow__() to the BSD
equivalents into wctype.h and all should work fine, yesno?

>I guess we should get some of the Asian developers
>to test Wine with Chinese/Japanese/Korean versions
>of Windows ...

Good idea. When it comes down to it, there's not much point in arguing
implementations under locales (such as latin-1) which don't really present
any challenge to the conversion routines. It's important to get right
though, because the infrequently seen locales exist.

John
-- 
i built it up now i take it apart climbed up real high now fall down real far
no need for me to stay the last thing left i just threw it away
i put my faith in god and my trust in you
now there's nothing more fucked up i could do
<p><a href="file:///dev/null">:-p</a>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.3.96.971224152325.28787A-100000>