From owner-freebsd-questions@freebsd.org Sun Nov 3 01:23:26 2019 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E305217E5E2 for ; Sun, 3 Nov 2019 01:23:26 +0000 (UTC) (envelope-from per@hedeland.org) Received: from mailout.easydns.com (mailout.easydns.com [64.68.202.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 475J8r5HBBz4RLK for ; Sun, 3 Nov 2019 01:23:24 +0000 (UTC) (envelope-from per@hedeland.org) Received: from localhost (localhost [127.0.0.1]) by mailout.easydns.com (Postfix) with ESMTP id DBBC1A0244 for ; Sun, 3 Nov 2019 01:23:22 +0000 (UTC) Received: from mailout.easydns.com ([127.0.0.1]) by localhost (emo13-pco.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Q2UJ-DaTPHXf for ; Sun, 3 Nov 2019 01:23:22 +0000 (UTC) Received: from hedeland.org (81-228-157-209-no289.tbcn.telia.com [81.228.157.209]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mailout.easydns.com (Postfix) with ESMTPSA id A09D9A0241 for ; Sun, 3 Nov 2019 01:23:22 +0000 (UTC) Received: from pluto.hedeland.org (pluto.hedeland.org [10.1.1.5]) by tellus.hedeland.org (8.15.2/8.15.2) with ESMTPS id xA31NJd4020700 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Sun, 3 Nov 2019 02:23:20 +0100 (CET) (envelope-from per@hedeland.org) Subject: Re: sort is broken To: freebsd-questions@freebsd.org References: <8221.1572732697@segfault.tristatelogic.com> <201911022329.10843.dr.klepp@gmx.at> From: Per Hedeland Message-ID: Date: Sun, 3 Nov 2019 02:23:19 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <201911022329.10843.dr.klepp@gmx.at> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 475J8r5HBBz4RLK X-Spamd-Bar: ++ Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of per@hedeland.org has no SPF policy when checking 64.68.202.10) smtp.mailfrom=per@hedeland.org X-Spamd-Result: default: False [2.17 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_COUNT_FIVE(0.00)[5]; RWL_MAILSPIKE_POSSIBLE(0.00)[10.202.68.64.rep.mailspike.net : 127.0.0.17]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; TO_DN_NONE(0.00)[]; AUTH_NA(1.00)[]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_SPAM_MEDIUM(0.19)[0.189,0]; RCVD_TLS_LAST(0.00)[]; URIBL_PBL(0.01)[hedeland.org]; NEURAL_SPAM_LONG(0.55)[0.552,0]; DMARC_NA(0.00)[hedeland.org]; R_SPF_NA(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[10.202.68.64.list.dnswl.org : 127.0.5.1]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:16686, ipnet:64.68.200.0/22, country:CA]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(0.62)[ip: (1.19), ipnet: 64.68.200.0/22(0.17), asn: 16686(1.81), country: CA(-0.09)]; FROM_EQ_ENVFROM(0.00)[] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Nov 2019 01:23:26 -0000 On 2019-11-02 23:29, Dr. Nikolaus Klepp wrote: > Anno domini 2019 Sat, 02 Nov 15:11:37 -0700 > Ronald F. Guilmette scripsit: >> In message , you wrote: >> >>> >>> >>> >>> On 11/2/19 5:14 PM, Ronald F. Guilmette wrote: >>>> Not a question, just an expression of grief and deep dismay. >>>> >>>> It is a sad day when even very fundamental tools, used in billions >>>> of scripts, such as /usr/bin/sort turn up broken. >>>> >>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241679 >>>> >>>> Regards, >>>> rfg >>>> >>> >>> root@q4:/ # sort a >>> zürich.email >>> root@q4:/ # sort < a >>> zürich.email >>> root@q4:/ # uname -a >>> FreeBSD q4.queldor.net 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC >>> amd64 >>> root@q4:/ # cat a >>> zürich.email >>> root@q4:/ # >>> >>> Seems to be fine on my 12.0 >> >> Well, I guess it's just me then... >> >> % uname -a >> FreeBSD segfault.tristatelogic.com 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64 >> % sort --version >> 2.3-FreeBSD >> >> >> What version of sort do you have? > > I remember that this sort of thing is around since at least 11.0. The problem occurs, when you have UFT-8 encoding set as default, but the input data is iso 8859-1. Some characters of iso 8859-1 (äöü...) are not valid in UTF-8. This is exactly the problem - in fact, by definition (see RFC 3629) *no* characters with values outside the range 0x00 to 0x7f are valid as-is in UTF-8 - this is the case for almost 80 characters in 8859-1 (ü is 0xfc). $ uname -a FreeBSD pluto.hedeland.org 12.0-RELEASE FreeBSD 12.0-RELEASE GENERIC amd64 $ env LANG=C sort < /tmp/test zürich.email $ env LANG=en_US.UTF-8 sort < /tmp/test sort: Illegal byte sequence And the "success" case: $ env LANG=en_US.UTF-8 sort /tmp/test zürich.email Not sure if it survives the e-mail encoding, but the output here has actually been *converted* to the correct UTF-8 representation - if my terminal was set up for UTF-8, I would actually see "ü" there. $ od -t x1 /tmp/test 0000000 7a fc 72 69 63 68 2e 65 6d 61 69 6c 0a 0000015 $ env LANG=en_US.UTF-8 sort /tmp/test | od -t x1 0000000 7a c3 bc 72 69 63 68 2e 65 6d 61 69 6c 0a 0000016 I wouldn't consider the "Illegal byte sequence" case a bug, but rather the "success" case - why is the content converted, and why is it different from stdin? --Per Hedeland