Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Apr 2023 11:08:29 +0100
From:      Jamie Landeg-Jones <jamie@catflap.org>
To:        yuri@aetern.org, phk@phk.freebsd.dk, delphij@gmail.com
Cc:        current@FreeBSD.org
Subject:   Re: find(1): I18N gone wild ?
Message-ID:  <202304201008.33KA8TpX077655@donotpassgo.dyslexicfish.net>
In-Reply-To: <CAGMYy3tz6iCU_tiE6NHoVPdXOZGtP%2BfskWMrLXyev8SR=xRSqQ@mail.gmail.com>
References:  <202304172106.33HL6RUX051407@critter.freebsd.dk> <CAGMYy3tz6iCU_tiE6NHoVPdXOZGtP%2BfskWMrLXyev8SR=xRSqQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Xin LI <delphij@gmail.com> wrote:

> This is expected behavior (in en_US.UTF-8 the ordering is AaBb, not ABab).
> You might want to set LC_COLLATE to C if C behavior is desirable.
>
> On Mon, Apr 17, 2023 at 2:06 PM Poul-Henning Kamp <phk@phk.freebsd.dk>
> wrote:
>
> > This surprised me:
> >
> >         # mkdir /tmp/P
> >         # cd /tmp/P
> >         # touch FOO
> >         # touch bar
> >         # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
> >         ./FOO
> >         # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
> >         ./FOO
> >         ./bar
> >
> > Really ?!

TL;DR Fix find(1) so it works as you expected. It's "legal" to do so.

Not quite expected behaviour. It used to be, but now the behaviour is 
officially undefined, (as mentined in the section that Yuri quoted)

When the locale collation first came in, there were numerous issues
like this, causing POSIX to change it to undefined (My guess is that
it had been one way for too long for them to specifically redefine it,
so "undefined" it became.)

However, "undefined" would also cover the original way of doing things,
and as so many things break unexpectedly, many applications now treat
such ranges as they did pre-locales.

There would be nothing wrong in therefore changing find(1) to give the
results you expected. (and in my opinion, I hope that that becomes
the defacto standard)

For further justification, note that "awk" in base (in  newer
versions at least) already gives the results you'd expect, as now
does "gawk".

In fact, a good summary of the situation, and why the gawk owner reverted
the code to treat all character ranges as the tradional pre-locale
situation is here:

https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

Let's follow suit!

Cheers, Jamie




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202304201008.33KA8TpX077655>