Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Apr 2023 20:18:21 +0200
From:      Yuri <yuri@aetern.org>
To:        Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: find(1): I18N gone wild ?
Message-ID:  <f3d84036-b36e-d0d7-874a-51872a4ea572@aetern.org>
In-Reply-To: <3e473603-f384-f176-e7cb-03409e16ec9c@aetern.org>
References:  <E427B1B8-22E0-47C0-BF47-0C4F1D5F962F.ref@yahoo.com> <E427B1B8-22E0-47C0-BF47-0C4F1D5F962F@yahoo.com> <3e473603-f384-f176-e7cb-03409e16ec9c@aetern.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Yuri wrote:
> Mark Millard wrote:
>> Dimitry Andric <dim_at_FreeBSD.org> wrote on
>> Date: Fri, 21 Apr 2023 10:38:05 UTC :
>>
>>> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote:
>>>> Van: Poul-Henning Kamp <phk@phk.freebsd.dk>
>>>> Datum: maandag, 17 april 2023 23:06
>>>> Aan: current@freebsd.org
>>>> Onderwerp: find(1): I18N gone wild ?
>>>> This surprised me:
>>>>
>>>> # mkdir /tmp/P
>>>> # cd /tmp/P
>>>> # touch FOO
>>>> # touch bar
>>>> # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
>>>> ./FOO
>>>> # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
>>>> ./FOO
>>>> ./bar
>>>>
>>>> Really ?!
>>> ...
>>>> My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents remark.
>>>
>>> Same here. However, I have read that with unicode, you should *never*
>>> use [A-Z] or [0-9], but character classes instead. That seems to give
>>> both files on macOS and Linux with [[:alpha:]]:
>>>
>>> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
>>> ./BAR
>>> ./foo
>>>
>>> and only the lowercase file with [[:lower:]]:
>>>
>>> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
>>> ./foo
>>>
>>> But on FreeBSD, these don't work at all:
>>>
>>> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
>>> <nothing>
>>>
>>> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
>>> <nothing>
>>>
>>> This is an interesting rabbit hole... :)
>>
>> FreeBSD:
>>
>>      -name pattern
>>              True if the last component of the pathname being examined matches
>>              pattern.  Special shell pattern matching characters (“[”, “]”,
>>              “*”, and “?”) may be used as part of pattern.  These characters
>>              may be matched explicitly by escaping them with a backslash
>>              (“\”).
>>
>> I conclude that [[:alpha:]] and [[:lower:]] were not
>> considered "Special shell pattern"s. "man glob"
>> indicates it is a shell specific builtin.
>>
>> macOS says similarly. Different shells, different
>> pattern notations and capabilities? Well, "man bash"
>> reports:
> [snip]
>> Seems like: pick your shell (as shown by echo $SHELL) and
>> that picks the pattern match rules used. (May be controllable
>> in the specific shell.)
> 
> No, the pattern is not passed to shell and shell used should not matter
> (pattern should be properly escaped).  The rules are here:
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13
> 
> ...which in turn refers to the following link for bracket expressions:
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
> 
> Why we don't support all of that is different story.

A bit more on this; first link applies both to find(1) and fnmatch(3),
and find uses fnmatch() internally (which is good), but even the
function that processes bracket expressions is called rangematch() and
that's really all it does ignoring other bracket expression rules:

https://cgit.freebsd.org/src/tree/lib/libc/gen/fnmatch.c#n234

So to "fix" find we just need to implement the bracket expressions
properly in fnmatch().



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f3d84036-b36e-d0d7-874a-51872a4ea572>