Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Apr 2023 12:51:55 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Yuri <yuri@aetern.org>, Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: find(1): I18N gone wild ?
Message-ID:  <BB0C0C86-30A5-4C32-A59C-D5B29BAA65F2@yahoo.com>
References:  <BB0C0C86-30A5-4C32-A59C-D5B29BAA65F2.ref@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Yuri <yuri_at_aetern.org> wrote on
Date: Fri, 21 Apr 2023 18:18:21 UTC :

> Yuri wrote:
> > Mark Millard wrote:
> >> Dimitry Andric <dim_at_FreeBSD.org> wrote on
> >> Date: Fri, 21 Apr 2023 10:38:05 UTC :
> >>
> >>> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> =
wrote:
> >>>> Van: Poul-Henning Kamp <phk@phk.freebsd.dk>
> >>>> Datum: maandag, 17 april 2023 23:06
> >>>> Aan: current@freebsd.org
> >>>> Onderwerp: find(1): I18N gone wild ?
> >>>> This surprised me:
> >>>>
> >>>> # mkdir /tmp/P
> >>>> # cd /tmp/P
> >>>> # touch FOO
> >>>> # touch bar
> >>>> # env LANG=3DC.UTF-8 find . -name '[A-Z]*' -print
> >>>> ./FOO
> >>>> # env LANG=3Den_US.UTF-8 find . -name '[A-Z]*' -print
> >>>> ./FOO
> >>>> ./bar
> >>>>
> >>>> Really ?!
> >>> ...
> >>>> My Mac and a Linux server only give ./FOO in both cases. Just a 2 =
cents remark.
> >>>
> >>> Same here. However, I have read that with unicode, you should =
*never*
> >>> use [A-Z] or [0-9], but character classes instead. That seems to =
give
> >>> both files on macOS and Linux with [[:alpha:]]:
> >>>
> >>> $ LANG=3Den_US.UTF-8 find . -name '[[:alpha:]]*' -print
> >>> ./BAR
> >>> ./foo
> >>>
> >>> and only the lowercase file with [[:lower:]]:
> >>>
> >>> $ LANG=3Den_US.UTF-8 find . -name '[[:lower:]]*' -print
> >>> ./foo
> >>>
> >>> But on FreeBSD, these don't work at all:
> >>>
> >>> $ LANG=3Den_US.UTF-8 find . -name '[[:alpha:]]*' -print
> >>> <nothing>
> >>>
> >>> $ LANG=3Den_US.UTF-8 find . -name '[[:lower:]]*' -print
> >>> <nothing>
> >>>
> >>> This is an interesting rabbit hole... :)
> >>
> >> FreeBSD:
> >>
> >> -name pattern
> >> True if the last component of the pathname being examined matches
> >> pattern. Special shell pattern matching characters (=E2=80=9C[=E2=80=9D=
, =E2=80=9C]=E2=80=9D,
> >> =E2=80=9C*=E2=80=9D, and =E2=80=9C?=E2=80=9D) may be used as part =
of pattern. These characters
> >> may be matched explicitly by escaping them with a backslash
> >> (=E2=80=9C\=E2=80=9D).
> >>
> >> I conclude that [[:alpha:]] and [[:lower:]] were not
> >> considered "Special shell pattern"s. "man glob"
> >> indicates it is a shell specific builtin.
> >>
> >> macOS says similarly. Different shells, different
> >> pattern notations and capabilities? Well, "man bash"
> >> reports:
> > [snip]
> >> Seems like: pick your shell (as shown by echo $SHELL) and
> >> that picks the pattern match rules used. (May be controllable
> >> in the specific shell.)
> >=20
> > No, the pattern is not passed to shell and shell used should not =
matter
> > (pattern should be properly escaped). The rules are here:
> >=20
> > =
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#=
tag_18_13
> >=20
> > ...which in turn refers to the following link for bracket =
expressions:
> >=20
> > =
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#t=
ag_09_03_05
> >=20
> > Why we don't support all of that is different story.
>=20
> A bit more on this; first link applies both to find(1) and fnmatch(3),
> and find uses fnmatch() internally (which is good), but even the
> function that processes bracket expressions is called rangematch() and
> that's really all it does ignoring other bracket expression rules:
>=20
> https://cgit.freebsd.org/src/tree/lib/libc/gen/fnmatch.c#n234
>=20
> So to "fix" find we just need to implement the bracket expressions
> properly in fnmatch().

Too bad the -name documentation does not track this
but points to shell notation.


The following confirms that even for the IEEE Std
1003.1-2001 that FreeBSD's find is documented to
be based on, the notations that you reference were
indicated.


FreeBSD's man page reports:

STANDARDS
     The find utility syntax is a superset of the syntax specified by =
the IEEE
     Std 1003.1-2001 (=E2=80=9CPOSIX.1=E2=80=9D) standard.

     All the single character options except -H and -L as well as -amin,
     -anewer, -cmin, -cnewer, -delete, -empty, -fstype, -iname, -inum,
     -iregex, -ls, -maxdepth, -mindepth, -mmin, -not, -path, -print0, =
-regex,
     -sparse and all of the -B* birthtime related primaries are =
extensions to
     IEEE Std 1003.1-2001 (=E2=80=9CPOSIX.1=E2=80=9D).
. . .


IEEE Std 1003.1-2001 find looks to be at:

https://pubs.opengroup.org/onlinepubs/009604499/utilities/find.html

-name  pattern The primary shall evaluate as true if the basename of the =
filename being examined matches pattern using the pattern matching =
notation described in Pattern Matching Notation.


=
https://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#=
tag_02_13

[ The open bracket shall introduce a pattern bracket expression.
The description of basic regular expression bracket expressions in the =
Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE =
Bracket Expression shall also apply to the pattern bracket expression,


=
https://pubs.opengroup.org/onlinepubs/009604499/basedefs/xbd_chap09.html#t=
ag_09_03_05

    =E2=80=A2 A character class expression shall represent the union of =
two sets:
        =E2=80=A2 The set of single-character collating elements whose =
characters belong to the character class, as defined in the LC_CTYPE =
category in the current locale.
        =E2=80=A2 An unspecified set of multi-character collating =
elements.
All character classes specified in the current locale shall be =
recognized. A character class expression is expressed as a character =
class name enclosed within bracket-colon ( "[:" and ":]" ) delimiters.
The following character class expressions shall be supported in all =
locales:
[:alnum:] [:cntrl:] [:lower:] [:space:]
[:alpha:] [:digit:] [:print:] [:upper:]
[:blank:] [:graph:] [:punct:] [:xdigit:]
In addition, character class expressions of the form:
[:name:]
are recognized in those locales where the name keyword has been given a =
charclass definition in the LC_CTYPE category.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BB0C0C86-30A5-4C32-A59C-D5B29BAA65F2>