Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Apr 2023 10:41:45 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Dimitry Andric <dim@FreeBSD.org>, Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: find(1): I18N gone wild ?
Message-ID:  <E427B1B8-22E0-47C0-BF47-0C4F1D5F962F@yahoo.com>
References:  <E427B1B8-22E0-47C0-BF47-0C4F1D5F962F.ref@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Dimitry Andric <dim_at_FreeBSD.org> wrote on
Date: Fri, 21 Apr 2023 10:38:05 UTC :

> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote:
> > Van: Poul-Henning Kamp <phk@phk.freebsd.dk>
> > Datum: maandag, 17 april 2023 23:06
> > Aan: current@freebsd.org
> > Onderwerp: find(1): I18N gone wild ?
> > This surprised me:
> >=20
> > # mkdir /tmp/P
> > # cd /tmp/P
> > # touch FOO
> > # touch bar
> > # env LANG=3DC.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > # env LANG=3Den_US.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > ./bar
> >=20
> > Really ?!
> ...
> > My Mac and a Linux server only give ./FOO in both cases. Just a 2 =
cents remark.
>=20
> Same here. However, I have read that with unicode, you should *never*
> use [A-Z] or [0-9], but character classes instead. That seems to give
> both files on macOS and Linux with [[:alpha:]]:
>=20
> $ LANG=3Den_US.UTF-8 find . -name '[[:alpha:]]*' -print
> ./BAR
> ./foo
>=20
> and only the lowercase file with [[:lower:]]:
>=20
> $ LANG=3Den_US.UTF-8 find . -name '[[:lower:]]*' -print
> ./foo
>=20
> But on FreeBSD, these don't work at all:
>=20
> $ LANG=3Den_US.UTF-8 find . -name '[[:alpha:]]*' -print
> <nothing>
>=20
> $ LANG=3Den_US.UTF-8 find . -name '[[:lower:]]*' -print
> <nothing>
>=20
> This is an interesting rabbit hole... :)

FreeBSD:

     -name pattern
             True if the last component of the pathname being examined =
matches
             pattern.  Special shell pattern matching characters =
(=E2=80=9C[=E2=80=9D, =E2=80=9C]=E2=80=9D,
             =E2=80=9C*=E2=80=9D, and =E2=80=9C?=E2=80=9D) may be used =
as part of pattern.  These characters
             may be matched explicitly by escaping them with a backslash
             (=E2=80=9C\=E2=80=9D).

I conclude that [[:alpha:]] and [[:lower:]] were not
considered "Special shell pattern"s. "man glob"
indicates it is a shell specific builtin.

macOS says similarly. Different shells, different
pattern notations and capabilities? Well, "man bash"
reports:

QUOTE
      Pattern Matching

        . . .
              Within [ and ], character classes can be specified using =
the syntax [:class:], where class is one of the following classes =
defined in the POSIX standard:
              alnum alpha ascii blank cntrl digit graph lower print =
punct space upper word xdigit
              A character class matches any character belonging to that =
class.  The word character class matches letters, digits, and the =
character _.

              Within [ and ], an equivalence class can be specified =
using the syntax [=3Dc=3D], which matches all characters with the same =
collation weight (as defined by the current locale) as the
              character c.

              Within [ and ], the syntax [.symbol.] matches the =
collating symbol symbol.

END QUOTE

"man zsh" does not document patterns but:

sh-3.2$ echo $SHELL
/bin/zsh
sh-3.2$ find . -name '[[:lower:]]*' -print
./bar

% ls -Tldt /bin/*sh
-r-xr-xr-x  1 root  wheel  1326688 Feb  9 01:39:53 2023 /bin/bash
-rwxr-xr-x  2 root  wheel  1153216 Feb  9 01:39:53 2023 /bin/csh
-rwxr-xr-x  1 root  wheel   307232 Feb  9 01:39:53 2023 /bin/dash
-r-xr-xr-x  1 root  wheel  2598864 Feb  9 01:39:53 2023 /bin/ksh
-rwxr-xr-x  1 root  wheel   134000 Feb  9 01:39:53 2023 /bin/sh
-rwxr-xr-x  2 root  wheel  1153216 Feb  9 01:39:53 2023 /bin/tcsh
-rwxr-xr-x  1 root  wheel  1377616 Feb  9 01:39:53 2023 /bin/zsh

But in each, even bash,

% echo $SHELL
/bin/zsh


With "find" not being part of the kernel, Linux may have
a number of variations across the operating systems.
Picking one . . .

openSUSE tumbleweed:

       -name pattern
              Base  of file name (the path with the leading directories =
removed) matches shell pattern pattern.  Because the leading directories =
are removed, the file names considered for a match
              with -name will never include a slash, so `-name a/b' will =
never match anything (you probably need to use -path instead).  A =
warning is issued if you try to do this, unless the  en-
              vironment variable POSIXLY_CORRECT is set.  The =
metacharacters (`*', `?', and `[]') match a `.' at the start of the base =
name (this is a change in findutils-4.2.2; see section STAN-
              DARDS CONFORMANCE below).  To ignore a directory and the =
files under it, use -prune rather than checking every file in the tree; =
see an example in the description  of  that  action.
              Braces  are  not  recognised as being special, despite the =
fact that some shells including Bash imbue braces with a special meaning =
in shell patterns.  The filename matching is per-
              formed with the use of the fnmatch(3) library function.  =
Don't forget to enclose the pattern in quotes in order to protect it =
from expansion by the shell.

"man 3 fnmatch" says:

       The fnmatch() function checks whether the string argument matches =
the pattern argument, which is a shell wildcard pattern (see glob(7)).

"man 7 glob" (not shell specific) in turn has a section on
"Character classes and internationalization" that reports:

QUOTE
. . .
. . . Therefore, POSIX extended the bracket notation  greatly,
       both  for  wildcard  patterns  and  for regular expressions.  In =
the above we saw three types of items that can occur in a bracket =
expression: namely (i) the negation, (ii) explicit single
       characters, and (iii) ranges.  POSIX specifies ranges in an =
internationally more useful way and adds three more types:

       (iii) Ranges X-Y comprise all characters that fall between X and =
Y (inclusive) in the current collating sequence as defined by the =
LC_COLLATE category in the current locale.

       (iv) Named character classes, like

       [:alnum:]  [:alpha:]  [:blank:]  [:cntrl:]
       [:digit:]  [:graph:]  [:lower:]  [:print:]
       [:punct:]  [:space:]  [:upper:]  [:xdigit:]

       so that one can say "[[:lower:]]" instead of "[a-z]", and have =
things work in Denmark, too, where there are three letters past 'z' in =
the alphabet.  These character classes are defined  by
       the LC_CTYPE category in the current locale.

       (v) Collating symbols, like "[.ch.]" or "[.a-acute.]", where the =
string between "[." and ".]" is a collating element defined for the =
current locale.  Note that this may be a multicharacter
       element.

       (vi) Equivalence class expressions, like "[=3Da=3D]", where the =
string between "[=3D" and "=3D]" is any collating element from its =
equivalence class, as defined for the current locale.  For  exam-
       ple, "[[=3Da=3D]]" might be equivalent to "[a=C3=A1=C3=A0=C3=A4=C3=A2=
]", that is, to "[a[.a-acute.][.a-grave.][.a-umlaut.][.a-circumflex.]]".
END QUOTE

# file /usr/bin/sh
/usr/bin/sh: symbolic link to bash


Seems like: pick your shell (as shown by echo $SHELL) and
that picks the pattern match rules used. (May be controllable
in the specific shell.)

=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E427B1B8-22E0-47C0-BF47-0C4F1D5F962F>