Date: Fri, 9 Jul 2021 16:36:40 +0200 From: Stefan Esser <se@freebsd.org> To: "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Warner Losh <imp@bsdimp.com> Cc: "freebsd-arch@freebsd.org" <freebsd-arch@FreeBSD.org> Subject: Re: FreeBSD awk behavior change proposal Message-ID: <621331d0-b7bb-0365-23f7-999dd7155c19@freebsd.org> In-Reply-To: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net> References: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --dJDvvnd1LRP4KQ30mox9kZQ3vvK1popsX Content-Type: multipart/mixed; boundary="aK93F3nj3f2dNOZVfqt9IvaPwTuUOH2EG"; protected-headers="v1" From: Stefan Esser <se@freebsd.org> To: "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Warner Losh <imp@bsdimp.com> Cc: "freebsd-arch@freebsd.org" <freebsd-arch@FreeBSD.org> Message-ID: <621331d0-b7bb-0365-23f7-999dd7155c19@freebsd.org> Subject: Re: FreeBSD awk behavior change proposal References: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net> In-Reply-To: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net> --aK93F3nj3f2dNOZVfqt9IvaPwTuUOH2EG Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Am 09.07.21 um 15:21 schrieb Rodney W. Grimes: >> Greetings, >> >> I've posted https://reviews.freebsd.org/D31114 which eliminates the l= ast >> delta we have from upstream one-true-awk. This delta has basically bee= n >> rejected by upstream as being a really bad idea. Let me give some >> background. >> >> In 2005, FreeBSD changed one-true-awk to honor the locale's collating = order. >> https://svnweb.freebsd.org/base/head/usr.bin/awk/b.c.diff?annotate=3D1= 46322&pathrev=3D201988 >> This was billed as a temporary patch. It was also compatible with >> the then-current behavior of gawk. That temporary patch has lasted 16 >> years now. >> >> However, IEEE Std 1003.1-2008 changed the behaivor of ranges in regula= r >> expressions outside of the "C" and "POSIX" locales to be undefined. >> >> Starting in 2011, gawk 4.0 stopped using the locale for the range >> regular expressions and used the traditional behavior only. The >> maintainer had grown weary of answering why '[A-Z]' would sometimes >> match lower-case expressions. The details about are explained here: >> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.= html >> >> To restore compatibility with other implementaitons of awk, revert thi= s >> patch. FreeBSD is the odd-system out. It also has the nice side effect= >> of eliminating the last of our differences with upstream one-true-awk.= >> >> I'd like to commit the change at least to -current. Ideally, I'd like = to MFC >> the change. I believe better compatibility with gawk and other awk >> implementations justifies this change in behavior because the current >> behavior is outside the mainstream enough to be considered a bug. >> >> I'd like to solicit input before I do this, however. >=20 > My only concern on this is does anything in the ports system get > tickled by this change, I know its a pita, but maybe have an exp > run done? I reviewed and accepted the differential, and by examination= > I do not see how this could cause an issue now, so Meh give it a long > back in -current and things should be ok. While possible in theory, I do not see how the ports system could be affected in practice. Ports are built in a C/POSIX locale on the official builders, and thus using a different locale and collating sequence on a user's system could break the port, but should never be a requirement. I have checked the port Makefiles for occurrences of LANG or LC_* outside specific command invocations (e.g. to set the locale for a sort command). These are the results: - ${USE_LOCALE} is used in bsd.port.mk, but the only case where a locale other than C or en_US.UTF-8 is specified is shells/fd which has USE_LOCALE=3Dja (i.e. does not specify an encoding). - ${ELIXIR_LOCALE} is used to set LANG and LC_ALL for USES=3Delixir. But ELIXIR_LOCALE is only ever set to en_US.UTF-8, AFAICT. - print/libpaper explicitly requests LANG=3DC LC_ALL=3DC for AWK. - The only port that requests a locale that is not en_US.UTF-8, en_US.ISO8859-1, or C is textproc/te-hunspell, which uses LANG=3Dte_IN.utf8 LC_ALL=3Dte_IN.utf8 to execute wordlist2hunspell, but only for this single shell script that does not invoke AWK and which does internally use LC_ALL=3DC for sort and uniq to make those not depend on an externally set locale. All other cases where LC_* or LANG are used in port Makefiles are in e.g. EXTRACT_CMD, TEST_ENV or in patch files, but those do enforce a C or C.UTF-8 locale (or en_US.*) and thus have no effect on the proposed change to AWK (besides often only setting the locale for a TAR file extraction). If an exp-run is planned for other reasons, using the modified AWK could be thrown in as a little risk modification. But I do not see any possible effect on the ports system, after performing a grep for LANG and LC_* on the Makefiles and patch files. Regards, STefan --aK93F3nj3f2dNOZVfqt9IvaPwTuUOH2EG-- --dJDvvnd1LRP4KQ30mox9kZQ3vvK1popsX Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmDoXvgFAwAAAAAACgkQR+u171r99UQ9 5wf/Uht4XKAbMIUdEx677UWmpFlICGwHfi9KZFVn3oAHFdRi8QeeziLcLjyPIFiuqUdRTD8gPRft 1h9HyIAAJIBSrbr1Hf5KlERGtY0TgIOLWEvvpc5JviD6yFkcYkluW4dC4mdWzqYxUJlHIcXBFxDL 29WmXXNMUUvNL9MzPuXZxaLd7zCbskPv6zVj91yr4oQ1n8bPEb3/zIrWmEciI7nRTCm01mpEtZ76 2VXmYWM8TNk1K95oe71bZ5W2zauob3SgYNNE6Xqs66vVkRB6ul/9IeMDZ4DEsUyaeZtbJrmZl0kB POw9T098FqqWgEmd85kRa/hZe+2tqrCKA+lk+pTMQg== =Ph1M -----END PGP SIGNATURE----- --dJDvvnd1LRP4KQ30mox9kZQ3vvK1popsX--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?621331d0-b7bb-0365-23f7-999dd7155c19>