From owner-freebsd-hackers Mon Oct 16 18:20:25 1995 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.6.12/8.6.6) id SAA02449 for hackers-outgoing; Mon, 16 Oct 1995 18:20:25 -0700 Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id SAA02442 for ; Mon, 16 Oct 1995 18:20:22 -0700 Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id SAA25982; Mon, 16 Oct 1995 18:15:16 -0700 From: Terry Lambert Message-Id: <199510170115.SAA25982@phaeton.artisoft.com> Subject: Re: A couple problems in FreeBSD 2.1.0-950922-SNAP To: ache@astral.msk.su (=?KOI8-R?Q?=E1=CE=C4=D2=C5=CA_=FE=C5=D2=CE=CF=D7?=) Date: Mon, 16 Oct 1995 18:15:15 -0700 (MST) Cc: terry@lambert.org, hackers@freefall.freebsd.org, joerg_wunsch@uriah.heep.sax.de, kaleb@x.org In-Reply-To: from "=?KOI8-R?Q?=E1=CE=C4=D2=C5=CA_=FE=C5=D2=CE=CF=D7?=" at Oct 17, 95 02:23:38 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 6415 Sender: owner-hackers@FreeBSD.org Precedence: bulk > >Well, fix the C locale's undefined behaviour to be the same as the defined > >8859-1 behaviour. Problem solved. > > It seems that you miss the point here. Most harmful are macros such > as isprint(), islower()/isupper(), isalpha(), ispunct(), etc. > all of them are different for various 8bit charsets, f.e. > isalhpa(8859-1) != isalpha(KOI8-R). > If you stuck with one particular version, f.e. 8859-1, is*() > functions will return incorrect values for any other charset used > screwing your screen and keyboard input. I.e. 8859-1 toupper() can > produce very strange char for KOI8-R input. Or 8859-1 checking > input for ispunct() can allow very strange KOI8-R chars sneak in. > Or 8859-1 isalhpa() for output can print very strange chars > for KOI8-R, etc. Don't forget, I use KOI8-R only for example, > you can find some 8859-* font to substitute instead of this name. I can *potentially* see ispunct() (though I can't think of any concrete examples off my head; maybe in -9?), and the collating sequence is a problem. But this is a problem regardless. If the code isn't internationalized, it isn't internationalized, and anything you do to pretend it is without actually fixing the code is a kludge. The correct thing to do is to call setlocale() in the source. You could, if you wanted a "quick fix", use setlocale(,""), per your crt0.o hack. > >Fix the C locale, not the crt0.o. Then, as time permits, fix the locale > >unaware code. > > What do you mean by fixing C locale exactly? Make it act like an 8859-x locale. That means 8859-1, with the exception of collation sequence and (I still don't have an example for this one) ispunct(). If you care about collation sequence, then you'll internationalize your code. > >As long as the characters are passed through unadulterated, there is > >no difference for n == 1 and n != 1 in the non-setlocale() called case, > >which is the issue. If the damn thing wasn't being called and the > >C locale were correctly defined for "undefined" code points, then there > >would not be a problem. > > What you mean by unaltered? They are unaltered, but they belongs > to different classes in different charsets, real separator is > is*() functions. Unadulterated. Unchanged by the interface without the knowledge of the user or his explicit approval of the change. The difference is that the 0x40-0x5f,0x60-0x7f changes for case conversion are universally applicable across all 8859-x sets. Only certain rarely used aspects of the default locale are affected, and those would require explicit use of the setlocale(0 to operate correctly in any case. > >Calling "setlocale()" for an otherwise non-internationalized program is > >a big mistake, and just compounds the C locale mistake. Correct the > >right code. > > BTW, when C program is known 8bit clean, what I and my users > want from FreeBSD is proper interaction with russian language. Then use 8859-5 character encoding. The only deficiency re: KOI8 is that it doesn't match existing data you already have on disk. Or explicitly call setlocale(). If the code is in fact 8 bit clean, then very little is left that needs to be done to make it internationalized, at least in the XPG/3 sense (runic encoding was introduced in XPG/4). > It means that > 1) all is*() macros must be correct for russian charset (LC_CTYPE). This will work for 8859-5. Characters that are completely bogus will fail, but they'd fail anyway. Don't mix locales on the same storage media or go to Unicode name storage and the problem will go away. Or explicitly call setlocale(), as recommentd in the X/Open Portability Guide. > 2) strftime must return national data (LC_TIME). Explicitly call setlocale(). > 3) National sorting must works (LC_COLLATE). Explicitly call setlocale(). Your sorts probably aren't using locale information anyway if you aren't calling setlocale(), so nothing has really changed between your hack and the non-hack (standards conformant) case on this one. > Now all that goals are reached by 'setenv ENABLE_STARTUP_LOCALE' > and without any program modifications. It is especially essential when > program isn't FreeBSD native but comes from 3rd party, i.e. > ports area. Moreover, they can be reached on any remote system > too, includes freefall f.e. There is an implied program modification of main (as opposed to _main). The correct way to make a program locale sensitive is to change its code so that it is locale sensitive. > The same words are true for 8859-1 users too, not only for KOI8-R > users. KOI8 is a peculiar locale in that it doesn't follow the 8859-x rules like it should. Like EBCDIC, it needs to die in the long term. On the other hand, if you desperately need to be able to use it, even given its implicit limitations, then you can do so. If you use locale aware code. > Maybe this functionality isn't kosher but you even can't imagine how > it is useful. > > If you know "proper way" to do things and keeps this goals non-broken too, > I am all ears. This whole issue is very similar to the problems that were involved in going to an unmapped page 0, causing NULL dereferences to SIGSEGV. In the short term, you lost functionality because you couldn't run some programs you used to be able to run. In the locale case, you lose the ability to run 8 bit clean code as if it had been properly internationalized, while making other code plain miserable to use. Without the imlied setlocale() call in crt0.o, there is an immediate benefit of ~1.1M of disk in static binaries (from Kaleb's numbers), and the code that isn't internationalized becomes readily apparent. Just as the code that dereferenced NULL became readily apparent when page 0 was unmapped. Setting an "undefined" equality with 8859-1 preserves 8 bit clean operability in the majority of cases, and in the others, the only way that they could have been able to get the functionality was to have partially internationalized their code (you can't get at the altered collation sequence without some knowledge of internationalization implicit in the code). The net effect is that more code gets internationalized correctly, which is in everyone's best interests and increases the code portability instead of tying the users to FreeBSD. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.