Date: Fri, 20 Aug 2021 09:09:03 -0600 From: Warner Losh <imp@bsdimp.com> To: Helge Oldach <freebsd@oldach.net> Cc: =?UTF-8?B?U3RlZmFuIEXDn2Vy?= <se@freebsd.org>, FreeBSD Stable ML <stable@freebsd.org> Subject: Re: Confusion with grep & locale? Message-ID: <CANCZdfqE4Oz5_NkUJma_eqA60AiOLXFd5h1dq%2BUyOz0t3qM73g@mail.gmail.com> In-Reply-To: <202108201417.17KEHt0w022450@nuc.oldach.net> References: <fbb028fa-19f4-60b2-24e9-549961c3f92f@freebsd.org> <202108201417.17KEHt0w022450@nuc.oldach.net>
index | next in thread | previous in thread | raw e-mail
On Fri, Aug 20, 2021 at 8:19 AM Helge Oldach <freebsd@oldach.net> wrote: > Stefan Esser wrote on Fri, 20 Aug 2021 14:47:11 +0200 (CEST): > > Am 20.08.21 um 11:03 schrieb Helge Oldach: > > But POSIX makes no guarantees for locales other than POSIX or C. > > OK, thanks for the explanation. That clarifies a lot for me. Although > it's not really POLA. :-) > > Thanks a lot also to Stefan Ehmann for the pointer to gawk oddities. > > > > # export LANG=en_US.ISO8859-1 > > > # (echo bla; echo Bla) | grep '[A-Z]' > > > bla > > > Bla > > > > This one is unexpected, the upper case should be a range of its own > > and should not include any lower case letters. > > > > > # export LANG=en_US.UTF-8 > > > # (echo bla; echo Bla) | grep '[A-Z]' > > > Bla > > > > Here I had expected the result you got with en_US.ISO8859-1 ... > > > Definitely a bug in the definition of the collating sequences. > > > > And I have just verified that de_DE.ISO8859-1 wrongly considers "รถ" > > to be within [a-z], while de_DE.UTF-8 does not (but should). > > > > Seems that the correct collating sequences for ISO8859-1 and UTF-8 are > > each assigned to the other one. > > PR 257972 raised. > I've looked at that, and I don't think it's a bug since posix says it's undefined behavior. > > > There is nothing special in the environment, specifically no LC_xxx nor > > > MM_CHARSET in either case. > > > > LANG defines LC_COLLATE, unless overridden. > > Indeed. I just explicitly mentioned *no* LC_xxx to clarify that it's not > overriden. :-) > > > BTW, character classes work for your examples and more: > > Certainly they do. But they harder to type... :-) > I think that A-Za-z is undefined, but :letter: is well defined. Most shell scripts use the 'C' locale for this very reason. Warnerhelp
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqE4Oz5_NkUJma_eqA60AiOLXFd5h1dq%2BUyOz0t3qM73g>
