From owner-freebsd-hackers  Mon Oct 16 18:20:25 1995
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.6.12/8.6.6) id SAA02449
          for hackers-outgoing; Mon, 16 Oct 1995 18:20:25 -0700
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id SAA02442
          for <hackers@freefall.freebsd.org>; Mon, 16 Oct 1995 18:20:22 -0700
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id SAA25982; Mon, 16 Oct 1995 18:15:16 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199510170115.SAA25982@phaeton.artisoft.com>
Subject: Re: A couple problems in FreeBSD 2.1.0-950922-SNAP
To: ache@astral.msk.su (=?KOI8-R?Q?=E1=CE=C4=D2=C5=CA_=FE=C5=D2=CE=CF=D7?=)
Date: Mon, 16 Oct 1995 18:15:15 -0700 (MST)
Cc: terry@lambert.org, hackers@freefall.freebsd.org,
        joerg_wunsch@uriah.heep.sax.de, kaleb@x.org
In-Reply-To: <YlwbkWmKE0@ache.dialup.demos.ru> from "=?KOI8-R?Q?=E1=CE=C4=D2=C5=CA_=FE=C5=D2=CE=CF=D7?=" at Oct 17, 95 02:23:38 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 6415      
Sender: owner-hackers@FreeBSD.org
Precedence: bulk

> >Well, fix the C locale's undefined behaviour to be the same as the defined
> >8859-1 behaviour.  Problem solved.
> 
> It seems that you miss the point here. Most harmful are macros such
> as isprint(), islower()/isupper(), isalpha(), ispunct(), etc.
> all of them are different for various 8bit charsets, f.e.
> isalhpa(8859-1) != isalpha(KOI8-R).
> If you stuck with one particular version, f.e. 8859-1, is*()
> functions will return incorrect values for any other charset used
> screwing your screen and keyboard input. I.e. 8859-1 toupper() can
> produce very strange char for KOI8-R input. Or 8859-1 checking
> input for ispunct() can allow very strange KOI8-R chars sneak in.
> Or 8859-1 isalhpa() for output can print very strange chars
> for KOI8-R, etc. Don't forget, I use KOI8-R only for example,
> you can find some 8859-* font to substitute instead of this name.

I can *potentially* see ispunct() (though I can't think of any
concrete examples off my head; maybe in -9?), and the collating
sequence is a problem.

But this is a problem regardless.  If the code isn't internationalized,
it isn't internationalized, and anything you do to pretend it is without
actually fixing the code is a kludge.

The correct thing to do is to call setlocale() in the source.  You could,
if you wanted a "quick fix", use setlocale(,""), per your crt0.o hack.

> >Fix the C locale, not the crt0.o.  Then, as time permits, fix the locale
> >unaware code.
> 
> What do you mean by fixing C locale exactly?

Make it act like an 8859-x locale.  That means 8859-1, with the exception
of collation sequence and (I still don't have an example for this one)
ispunct().

If you care about collation sequence, then you'll internationalize your
code.

> >As long as the characters are passed through unadulterated, there is
> >no difference for n == 1 and n != 1 in the non-setlocale() called case,
> >which is the issue.  If the damn thing wasn't being called and the
> >C locale were correctly defined for "undefined" code points, then there
> >would not be a problem.
> 
> What you mean by unaltered? They are unaltered, but they belongs
> to different classes in different charsets, real separator is
> is*() functions.

Unadulterated.  Unchanged by the interface without the knowledge of the
user or his explicit approval of the change.

The difference is that the 0x40-0x5f,0x60-0x7f changes for case conversion
are universally applicable across all 8859-x sets.  Only certain rarely
used aspects of the default locale are affected, and those would require
explicit use of the setlocale(0 to operate correctly in any case.

> >Calling "setlocale()" for an otherwise non-internationalized program is
> >a big mistake, and just compounds the C locale mistake.  Correct the
> >right code.
> 
> BTW, when C program is known 8bit clean, what I and my users
> want from FreeBSD is proper interaction with russian language.

Then use 8859-5 character encoding.  The only deficiency re: KOI8 is
that it doesn't match existing data you already have on disk.

Or explicitly call setlocale().  If the code is in fact 8 bit clean, then
very little is left that needs to be done to make it internationalized,
at least in the XPG/3 sense (runic encoding was introduced in XPG/4).

> It means that
> 1) all is*() macros must be correct for russian charset (LC_CTYPE).

This will work for 8859-5.  Characters that are completely bogus will
fail, but they'd fail anyway.  Don't mix locales on the same storage
media or go to Unicode name storage and the problem will go away.

Or explicitly call setlocale(), as recommentd in the X/Open Portability
Guide.

> 2) strftime must return national data (LC_TIME).

Explicitly call setlocale().

> 3) National sorting must works (LC_COLLATE).

Explicitly call setlocale().  Your sorts probably aren't using locale
information anyway if you aren't calling setlocale(), so nothing has
really changed between your hack and the non-hack (standards conformant)
case on this one.

> Now all that goals are reached by 'setenv ENABLE_STARTUP_LOCALE'
> and without any program modifications. It is especially essential when
> program isn't FreeBSD native but comes from 3rd party, i.e.
> ports area. Moreover, they can be reached on any remote system
> too, includes freefall f.e.

There is an implied program modification of main (as opposed to _main).

The correct way to make a program locale sensitive is to change its code
so that it is locale sensitive.

> The same words are true for 8859-1 users too, not only for KOI8-R
> users.

KOI8 is a peculiar locale in that it doesn't follow the 8859-x rules
like it should.  Like EBCDIC, it needs to die in the long term.  On
the other hand, if you desperately need to be able to use it, even
given its implicit limitations, then you can do so.  If you use locale
aware code.


> Maybe this functionality isn't kosher but you even can't imagine how
> it is useful.
> 
> If you know "proper way" to do things and keeps this goals non-broken too,
> I am all ears.


This whole issue is very similar to the problems that were involved in
going to an unmapped page 0, causing NULL dereferences to SIGSEGV.  In
the short term, you lost functionality because you couldn't run some
programs you used to be able to run.

In the locale case, you lose the ability to run 8 bit clean code as if
it had been properly internationalized, while making other code plain
miserable to use.

Without the imlied setlocale() call in crt0.o, there is an immediate
benefit of ~1.1M of disk in static binaries (from Kaleb's numbers), and
the code that isn't internationalized becomes readily apparent.  Just
as the code that dereferenced NULL became readily apparent when page 0
was unmapped.


Setting an "undefined" equality with 8859-1 preserves 8 bit clean
operability in the majority of cases, and in the others, the only
way that they could have been able to get the functionality was to
have partially internationalized their code (you can't get at the
altered collation sequence without some knowledge of internationalization
implicit in the code).


The net effect is that more code gets internationalized correctly, which
is in everyone's best interests and increases the code portability instead
of tying the users to FreeBSD.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.