Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Aug 2021 09:09:03 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Helge Oldach <freebsd@oldach.net>
Cc:        =?UTF-8?B?U3RlZmFuIEXDn2Vy?= <se@freebsd.org>,  FreeBSD Stable ML <stable@freebsd.org>
Subject:   Re: Confusion with grep & locale?
Message-ID:  <CANCZdfqE4Oz5_NkUJma_eqA60AiOLXFd5h1dq%2BUyOz0t3qM73g@mail.gmail.com>
In-Reply-To: <202108201417.17KEHt0w022450@nuc.oldach.net>
References:  <fbb028fa-19f4-60b2-24e9-549961c3f92f@freebsd.org> <202108201417.17KEHt0w022450@nuc.oldach.net>

index | next in thread | previous in thread | raw e-mail

On Fri, Aug 20, 2021 at 8:19 AM Helge Oldach <freebsd@oldach.net> wrote:

> Stefan Esser wrote on Fri, 20 Aug 2021 14:47:11 +0200 (CEST):
> > Am 20.08.21 um 11:03 schrieb Helge Oldach:
> > But POSIX makes no guarantees for locales other than POSIX or C.
>
> OK, thanks for the explanation. That clarifies a lot for me. Although
> it's not really POLA. :-)
>
> Thanks a lot also to Stefan Ehmann for the pointer to gawk oddities.
>
> > > # export LANG=en_US.ISO8859-1
> > > # (echo bla; echo Bla) | grep '[A-Z]'
> > > bla
> > > Bla
> >
> > This one is unexpected, the upper case should be a range of its own
> > and should not include any lower case letters.
> >
> > > # export LANG=en_US.UTF-8
> > > # (echo bla; echo Bla) | grep '[A-Z]'
> > > Bla
> >
> > Here I had expected the result you got with en_US.ISO8859-1 ...
>
> > Definitely a bug in the definition of the collating sequences.
> >
> > And I have just verified that de_DE.ISO8859-1 wrongly considers "รถ"
> > to be within [a-z], while de_DE.UTF-8 does not (but should).
> >
> > Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
> > each assigned to the other one.
>
> PR 257972 raised.
>

I've looked at that, and I don't think it's a bug since posix says it's
undefined behavior.


> > > There is nothing special in the environment, specifically no LC_xxx nor
> > > MM_CHARSET in either case.
> >
> > LANG defines LC_COLLATE, unless overridden.
>
> Indeed. I just explicitly mentioned *no* LC_xxx to clarify that it's not
> overriden. :-)
>
> > BTW, character classes work for your examples and more:
>
> Certainly they do. But they harder to type... :-)
>

I think that A-Za-z is undefined, but :letter: is well defined. Most shell
scripts use the 'C' locale for this very reason.

Warner

help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqE4Oz5_NkUJma_eqA60AiOLXFd5h1dq%2BUyOz0t3qM73g>