Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 9 Jul 2021 16:36:40 +0200
From:      Stefan Esser <se@freebsd.org>
To:        "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Warner Losh <imp@bsdimp.com>
Cc:        "freebsd-arch@freebsd.org" <freebsd-arch@FreeBSD.org>
Subject:   Re: FreeBSD awk behavior change proposal
Message-ID:  <621331d0-b7bb-0365-23f7-999dd7155c19@freebsd.org>
In-Reply-To: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net>
References:  <202107091321.169DLTZY041684@gndrsh.dnsmgr.net>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--dJDvvnd1LRP4KQ30mox9kZQ3vvK1popsX
Content-Type: multipart/mixed; boundary="aK93F3nj3f2dNOZVfqt9IvaPwTuUOH2EG";
 protected-headers="v1"
From: Stefan Esser <se@freebsd.org>
To: "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>,
 Warner Losh <imp@bsdimp.com>
Cc: "freebsd-arch@freebsd.org" <freebsd-arch@FreeBSD.org>
Message-ID: <621331d0-b7bb-0365-23f7-999dd7155c19@freebsd.org>
Subject: Re: FreeBSD awk behavior change proposal
References: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net>
In-Reply-To: <202107091321.169DLTZY041684@gndrsh.dnsmgr.net>

--aK93F3nj3f2dNOZVfqt9IvaPwTuUOH2EG
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

Am 09.07.21 um 15:21 schrieb Rodney W. Grimes:
>> Greetings,
>>
>> I've posted  https://reviews.freebsd.org/D31114 which eliminates the l=
ast
>> delta we have from upstream one-true-awk. This delta has basically bee=
n
>> rejected by upstream as being a really bad idea. Let me give some
>> background.
>>
>> In 2005, FreeBSD changed one-true-awk to honor the locale's collating =
order.
>> https://svnweb.freebsd.org/base/head/usr.bin/awk/b.c.diff?annotate=3D1=
46322&pathrev=3D201988
>> This was billed as a temporary patch. It was also compatible with
>> the then-current behavior of gawk. That temporary patch has lasted 16
>> years now.
>>
>> However, IEEE Std 1003.1-2008 changed the behaivor of ranges in regula=
r
>> expressions outside of the "C" and "POSIX" locales to be undefined.
>>
>> Starting in 2011, gawk 4.0 stopped using the locale for the range
>> regular expressions and used the traditional behavior only. The
>> maintainer had grown weary of answering why '[A-Z]' would sometimes
>> match lower-case expressions. The details about are explained here:
>> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.=
html
>>
>> To restore compatibility with other implementaitons of awk, revert thi=
s
>> patch. FreeBSD is the odd-system out. It also has the nice side effect=

>> of eliminating the last of our differences with upstream one-true-awk.=

>>
>> I'd like to commit the change at least to -current. Ideally, I'd like =
to MFC
>> the change. I believe better compatibility with gawk and other awk
>> implementations justifies this change in behavior because the current
>> behavior is outside the mainstream enough to be considered a bug.
>>
>> I'd like to solicit input before I do this, however.
>=20
> My only concern on this is does anything in the ports system get
> tickled by this change, I know its a pita, but maybe have an exp
> run done?  I reviewed and accepted the differential, and by examination=

> I do not see how this could cause an issue now, so Meh give it a long
> back in -current and things should be ok.

While possible in theory, I do not see how the ports system could
be affected in practice.

Ports are built in a C/POSIX locale on the official builders, and
thus using a different locale and collating sequence on a user's
system could break the port, but should never be a requirement.

I have checked the port Makefiles for occurrences of LANG or LC_*
outside specific command invocations (e.g. to set the locale for
a sort command). These are the results:

- ${USE_LOCALE} is used in bsd.port.mk, but the only case where
  a locale other than C or en_US.UTF-8 is specified is shells/fd
  which has USE_LOCALE=3Dja (i.e. does not specify an encoding).

- ${ELIXIR_LOCALE} is used to set LANG and LC_ALL for USES=3Delixir.
  But ELIXIR_LOCALE is only ever set to en_US.UTF-8, AFAICT.

- print/libpaper explicitly requests LANG=3DC LC_ALL=3DC for AWK.

- The only port that requests a locale that is not en_US.UTF-8,
  en_US.ISO8859-1, or C is textproc/te-hunspell, which uses
  LANG=3Dte_IN.utf8 LC_ALL=3Dte_IN.utf8 to execute wordlist2hunspell,
  but only for this single shell script that does not invoke AWK
  and which does internally use LC_ALL=3DC for sort and uniq to
  make those not depend on an externally set locale.

All other cases where LC_* or LANG are used in port Makefiles are
in e.g. EXTRACT_CMD, TEST_ENV or in patch files, but those do
enforce a C or C.UTF-8 locale (or en_US.*) and thus have no effect
on the proposed change to AWK (besides often only setting the locale
for a TAR file extraction).

If an exp-run is planned for other reasons, using the modified
AWK could be thrown in as a little risk modification.

But I do not see any possible effect on the ports system, after
performing a grep for LANG and LC_* on the Makefiles and patch
files.

Regards, STefan


--aK93F3nj3f2dNOZVfqt9IvaPwTuUOH2EG--

--dJDvvnd1LRP4KQ30mox9kZQ3vvK1popsX
Content-Type: application/pgp-signature; name="OpenPGP_signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="OpenPGP_signature"

-----BEGIN PGP SIGNATURE-----

wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmDoXvgFAwAAAAAACgkQR+u171r99UQ9
5wf/Uht4XKAbMIUdEx677UWmpFlICGwHfi9KZFVn3oAHFdRi8QeeziLcLjyPIFiuqUdRTD8gPRft
1h9HyIAAJIBSrbr1Hf5KlERGtY0TgIOLWEvvpc5JviD6yFkcYkluW4dC4mdWzqYxUJlHIcXBFxDL
29WmXXNMUUvNL9MzPuXZxaLd7zCbskPv6zVj91yr4oQ1n8bPEb3/zIrWmEciI7nRTCm01mpEtZ76
2VXmYWM8TNk1K95oe71bZ5W2zauob3SgYNNE6Xqs66vVkRB6ul/9IeMDZ4DEsUyaeZtbJrmZl0kB
POw9T098FqqWgEmd85kRa/hZe+2tqrCKA+lk+pTMQg==
=Ph1M
-----END PGP SIGNATURE-----

--dJDvvnd1LRP4KQ30mox9kZQ3vvK1popsX--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?621331d0-b7bb-0365-23f7-999dd7155c19>