Date: Fri, 21 Apr 2023 12:51:55 -0700 From: Mark Millard <marklmi@yahoo.com> To: Yuri <yuri@aetern.org>, Current FreeBSD <freebsd-current@freebsd.org> Subject: Re: find(1): I18N gone wild ? Message-ID: <BB0C0C86-30A5-4C32-A59C-D5B29BAA65F2@yahoo.com> References: <BB0C0C86-30A5-4C32-A59C-D5B29BAA65F2.ref@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Yuri <yuri_at_aetern.org> wrote on Date: Fri, 21 Apr 2023 18:18:21 UTC : > Yuri wrote: > > Mark Millard wrote: > >> Dimitry Andric <dim_at_FreeBSD.org> wrote on > >> Date: Fri, 21 Apr 2023 10:38:05 UTC : > >> > >>> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> = wrote: > >>>> Van: Poul-Henning Kamp <phk@phk.freebsd.dk> > >>>> Datum: maandag, 17 april 2023 23:06 > >>>> Aan: current@freebsd.org > >>>> Onderwerp: find(1): I18N gone wild ? > >>>> This surprised me: > >>>> > >>>> # mkdir /tmp/P > >>>> # cd /tmp/P > >>>> # touch FOO > >>>> # touch bar > >>>> # env LANG=3DC.UTF-8 find . -name '[A-Z]*' -print > >>>> ./FOO > >>>> # env LANG=3Den_US.UTF-8 find . -name '[A-Z]*' -print > >>>> ./FOO > >>>> ./bar > >>>> > >>>> Really ?! > >>> ... > >>>> My Mac and a Linux server only give ./FOO in both cases. Just a 2 = cents remark. > >>> > >>> Same here. However, I have read that with unicode, you should = *never* > >>> use [A-Z] or [0-9], but character classes instead. That seems to = give > >>> both files on macOS and Linux with [[:alpha:]]: > >>> > >>> $ LANG=3Den_US.UTF-8 find . -name '[[:alpha:]]*' -print > >>> ./BAR > >>> ./foo > >>> > >>> and only the lowercase file with [[:lower:]]: > >>> > >>> $ LANG=3Den_US.UTF-8 find . -name '[[:lower:]]*' -print > >>> ./foo > >>> > >>> But on FreeBSD, these don't work at all: > >>> > >>> $ LANG=3Den_US.UTF-8 find . -name '[[:alpha:]]*' -print > >>> <nothing> > >>> > >>> $ LANG=3Den_US.UTF-8 find . -name '[[:lower:]]*' -print > >>> <nothing> > >>> > >>> This is an interesting rabbit hole... :) > >> > >> FreeBSD: > >> > >> -name pattern > >> True if the last component of the pathname being examined matches > >> pattern. Special shell pattern matching characters (=E2=80=9C[=E2=80=9D= , =E2=80=9C]=E2=80=9D, > >> =E2=80=9C*=E2=80=9D, and =E2=80=9C?=E2=80=9D) may be used as part = of pattern. These characters > >> may be matched explicitly by escaping them with a backslash > >> (=E2=80=9C\=E2=80=9D). > >> > >> I conclude that [[:alpha:]] and [[:lower:]] were not > >> considered "Special shell pattern"s. "man glob" > >> indicates it is a shell specific builtin. > >> > >> macOS says similarly. Different shells, different > >> pattern notations and capabilities? Well, "man bash" > >> reports: > > [snip] > >> Seems like: pick your shell (as shown by echo $SHELL) and > >> that picks the pattern match rules used. (May be controllable > >> in the specific shell.) > >=20 > > No, the pattern is not passed to shell and shell used should not = matter > > (pattern should be properly escaped). The rules are here: > >=20 > > = https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#= tag_18_13 > >=20 > > ...which in turn refers to the following link for bracket = expressions: > >=20 > > = https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#t= ag_09_03_05 > >=20 > > Why we don't support all of that is different story. >=20 > A bit more on this; first link applies both to find(1) and fnmatch(3), > and find uses fnmatch() internally (which is good), but even the > function that processes bracket expressions is called rangematch() and > that's really all it does ignoring other bracket expression rules: >=20 > https://cgit.freebsd.org/src/tree/lib/libc/gen/fnmatch.c#n234 >=20 > So to "fix" find we just need to implement the bracket expressions > properly in fnmatch(). Too bad the -name documentation does not track this but points to shell notation. The following confirms that even for the IEEE Std 1003.1-2001 that FreeBSD's find is documented to be based on, the notations that you reference were indicated. FreeBSD's man page reports: STANDARDS The find utility syntax is a superset of the syntax specified by = the IEEE Std 1003.1-2001 (=E2=80=9CPOSIX.1=E2=80=9D) standard. All the single character options except -H and -L as well as -amin, -anewer, -cmin, -cnewer, -delete, -empty, -fstype, -iname, -inum, -iregex, -ls, -maxdepth, -mindepth, -mmin, -not, -path, -print0, = -regex, -sparse and all of the -B* birthtime related primaries are = extensions to IEEE Std 1003.1-2001 (=E2=80=9CPOSIX.1=E2=80=9D). . . . IEEE Std 1003.1-2001 find looks to be at: https://pubs.opengroup.org/onlinepubs/009604499/utilities/find.html -name pattern The primary shall evaluate as true if the basename of the = filename being examined matches pattern using the pattern matching = notation described in Pattern Matching Notation. = https://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#= tag_02_13 [ The open bracket shall introduce a pattern bracket expression. The description of basic regular expression bracket expressions in the = Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE = Bracket Expression shall also apply to the pattern bracket expression, = https://pubs.opengroup.org/onlinepubs/009604499/basedefs/xbd_chap09.html#t= ag_09_03_05 =E2=80=A2 A character class expression shall represent the union of = two sets: =E2=80=A2 The set of single-character collating elements whose = characters belong to the character class, as defined in the LC_CTYPE = category in the current locale. =E2=80=A2 An unspecified set of multi-character collating = elements. All character classes specified in the current locale shall be = recognized. A character class expression is expressed as a character = class name enclosed within bracket-colon ( "[:" and ":]" ) delimiters. The following character class expressions shall be supported in all = locales: [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] [:xdigit:] In addition, character class expressions of the form: [:name:] are recognized in those locales where the name keyword has been given a = charclass definition in the LC_CTYPE category. =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BB0C0C86-30A5-4C32-A59C-D5B29BAA65F2>