Date: Sun, 3 Nov 2019 02:23:19 +0100 From: Per Hedeland <per@hedeland.org> To: freebsd-questions@freebsd.org Subject: Re: sort is broken Message-ID: <f527bfae-4615-e2ab-2ddc-e5da657c7648@hedeland.org> In-Reply-To: <201911022329.10843.dr.klepp@gmx.at> References: <8221.1572732697@segfault.tristatelogic.com> <201911022329.10843.dr.klepp@gmx.at>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2019-11-02 23:29, Dr. Nikolaus Klepp wrote: > Anno domini 2019 Sat, 02 Nov 15:11:37 -0700 > Ronald F. Guilmette scripsit: >> In message <eec0b13b-b5d6-7e51-6241-8e1898150315@queldor.net>, you wrote: >> >>> >>> >>> >>> On 11/2/19 5:14 PM, Ronald F. Guilmette wrote: >>>> Not a question, just an expression of grief and deep dismay. >>>> >>>> It is a sad day when even very fundamental tools, used in billions >>>> of scripts, such as /usr/bin/sort turn up broken. >>>> >>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241679 >>>> >>>> Regards, >>>> rfg >>>> >>> >>> root@q4:/ # sort a >>> zürich.email >>> root@q4:/ # sort < a >>> zürich.email >>> root@q4:/ # uname -a >>> FreeBSD q4.queldor.net 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC >>> amd64 >>> root@q4:/ # cat a >>> zürich.email >>> root@q4:/ # >>> >>> Seems to be fine on my 12.0 >> >> Well, I guess it's just me then... >> >> % uname -a >> FreeBSD segfault.tristatelogic.com 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64 >> % sort --version >> 2.3-FreeBSD >> >> >> What version of sort do you have? > > I remember that this sort of thing is around since at least 11.0. The problem occurs, when you have UFT-8 encoding set as default, but the input data is iso 8859-1. Some characters of iso 8859-1 (äöü...) are not valid in UTF-8. This is exactly the problem - in fact, by definition (see RFC 3629) *no* characters with values outside the range 0x00 to 0x7f are valid as-is in UTF-8 - this is the case for almost 80 characters in 8859-1 (ü is 0xfc). $ uname -a FreeBSD pluto.hedeland.org 12.0-RELEASE FreeBSD 12.0-RELEASE GENERIC amd64 $ env LANG=C sort < /tmp/test zürich.email $ env LANG=en_US.UTF-8 sort < /tmp/test sort: Illegal byte sequence And the "success" case: $ env LANG=en_US.UTF-8 sort /tmp/test zürich.email Not sure if it survives the e-mail encoding, but the output here has actually been *converted* to the correct UTF-8 representation - if my terminal was set up for UTF-8, I would actually see "ü" there. $ od -t x1 /tmp/test 0000000 7a fc 72 69 63 68 2e 65 6d 61 69 6c 0a 0000015 $ env LANG=en_US.UTF-8 sort /tmp/test | od -t x1 0000000 7a c3 bc 72 69 63 68 2e 65 6d 61 69 6c 0a 0000016 I wouldn't consider the "Illegal byte sequence" case a bug, but rather the "success" case - why is the content converted, and why is it different from stdin? --Per Hedeland
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f527bfae-4615-e2ab-2ddc-e5da657c7648>