From owner-freebsd-questions@freebsd.org Mon Nov 4 08:22:10 2019 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 5373E1B7106 for ; Mon, 4 Nov 2019 08:22:10 +0000 (UTC) (envelope-from dr.klepp@gmx.at) Received: from vie01a-dmta-at52-3.mx.upcmail.net (vie01a-dmta-at52-3.mx.upcmail.net [62.179.121.144]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4765PY2mpvz46cb for ; Mon, 4 Nov 2019 08:22:09 +0000 (UTC) (envelope-from dr.klepp@gmx.at) Received: from [172.31.216.41] (helo=vie01a-pemc-psmtp-at50) by vie01a-dmta-at52.mx.upcmail.net with esmtp (Exim 4.92) (envelope-from ) id 1iRXXG-0009Ip-OB for freebsd-questions@freebsd.org; Mon, 04 Nov 2019 09:16:26 +0100 Received: from x61.lan ([85.126.97.210]) by vie01a-pemc-psmtp-at50 with SMTP @ mailcloud.upcmail.net id MkGQ2100r4YLlkt0BkGQkJ; Mon, 04 Nov 2019 09:16:24 +0100 X-SourceIP: 85.126.97.210 X-CNFS-Analysis: v=2.2 cv=O6RJhF1W c=1 sm=2 tr=0 cx=a_idp_f a=/Ac8Q0O/YFE5LOLfUiYZVw==:117 a=/Ac8Q0O/YFE5LOLfUiYZVw==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=8nJEP1OIZ-IA:10 a=_yevK9wwAAAA:8 a=6I5d2MoRAAAA:8 a=p5ZU5XSpSokwjh7HsUwA:9 a=wPNLvfGTeEIA:10 a=QCj3g80VS2wd9d6ItcEr:22 a=IjZwj45LgO3ly-622nXo:22 From: "Dr. Nikolaus Klepp" To: freebsd-questions@freebsd.org Subject: Re: sort is broken Date: Mon, 4 Nov 2019 09:16:30 +0100 User-Agent: KMail/1.9.10 References: <12754.1572819648@segfault.tristatelogic.com> In-Reply-To: <12754.1572819648@segfault.tristatelogic.com> X-KMail-QuotePrefix: > MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <201911040916.30802.dr.klepp@gmx.at> X-Rspamd-Queue-Id: 4765PY2mpvz46cb X-Spamd-Bar: +++++++ Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=fail (mx1.freebsd.org: domain of dr.klepp@gmx.at does not designate 62.179.121.144 as permitted sender) smtp.mailfrom=dr.klepp@gmx.at X-Spamd-Result: default: False [7.80 / 15.00]; ARC_NA(0.00)[]; R_SPF_FAIL(1.00)[-all]; FROM_HAS_DN(0.00)[]; FREEMAIL_FROM(0.00)[gmx.at]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; DMARC_NA(0.00)[gmx.at]; NEURAL_SPAM_MEDIUM(1.00)[0.996,0]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_THREE(0.00)[3]; RCVD_TLS_LAST(0.00)[]; NEURAL_SPAM_LONG(1.00)[1.000,0]; MID_CONTAINS_FROM(1.00)[]; FROM_NAME_HAS_TITLE(1.00)[dr]; IP_SCORE_FREEMAIL(0.00)[]; IP_SCORE(0.00)[ipnet: 62.179.0.0/17(1.33), asn: 6830(3.75), country: AT(-0.10)]; FORGED_MUA_KMAIL_MSGID(3.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[144.121.179.62.list.dnswl.org : 127.0.5.1]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmx.at]; ASN(0.00)[asn:6830, ipnet:62.179.0.0/17, country:AT]; MIME_TRACE(0.00)[0:+]; GREYLIST(0.00)[pass,body]; FROM_EQ_ENVFROM(0.00)[] X-Spam: Yes X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Nov 2019 08:22:10 -0000 Anno domini 2019 Sun, 03 Nov 14:20:48 -0800 Ronald F. Guilmette scripsit: > In message , > Per Hedeland wrote: > > >> In my env, LC_ALL is not set at all. > >> > >> I do have these, but not sure if they make any difference: > >> > >> LANG=en_US.UTF-8 > > > >This, in combination with trying to sort a file with contents that > >*isn't* valid UTF-8, is the reason for the behavior you observe - see > >my previous post. > > While the above may perhaps *explain* the behvior I've reported, I do > not feel that it excuses it. Not even marginally. I say that for > three reasons. > > 1) There are -zero- curcumstances in which in makes any sense whstsoever > to have the results of the following two commands be in the least bit > different: > > sort file > sort < file > > Any difference in resuts between the above two commands, by definition, > violates the design principal of least surprise and is thus wholly > inappropriate, in my opinion, regardless of environmental circumstances. > > 2) The data I attempted to sort does *not* as far as I am able to deternmine > conatin anything which is in any sense "illegal" or even invalid UTF-8. > Quite the contrary, in fact. I am able to view the line in question with > no problems by simply cat'ing it to my UTF-8 enabled xterm window, and I > was alos able to upload it to Pastebin, where it displays in a manner that > was exactly as intended, I think, with a umlaut over the "u" in zuruich, > and lastely I also pasted it into ny Bugzilla bug report in this issue > where it also displays in a quite reasonable and expected fashion. Given > these facts, I am favorably inclined to believe that the string in question, > which certainly contains a byte sequence that falls outside of the confines > of 7-bit ASCII, does not contain any improper UTF-8 sequences. > > 3) EVEN IF the line in question had in fact contained some invalid byte > sequence, even when construed in accordance with UTF-8, the response of > /usr/bin/sort in this instance is inconsistant, as noted in (1) above, and > even if that were not the case, the response of /usr/bin/sort is clearly > sub-optimal. When faced with a "bad" byte sequence, sort could have, and > arguably should have fallen back and simply treated the bytes as bytes, > without interpretation, possibly issuing a non-fatal *warning* rather than > issuing a hard error and totally abandoning the task at hand, which is what > sort did in fact do in this case. > > > >If you convert your file to UTF-8, e.g. using the strange behavior of > >'sort': > > > >$ sort test > test.utf8 > >... > > I was not aware, until now, that /usr/bin/sort was, in addition to its > primary function, also a data conversion utility. More to the point, > I would argue that the UNIX philosophy of having a large number of tools, > each of which performs one, and only one job, is violated if sort is now > also performing an additional (and unrequested) data conversion function. I too think it is quite strange for a tool that works on data streams to not handle that kind of "malformed" input data gracefully. This does not only apply to "sort", but all other tools as well. Look at GNU sort, it handles all input data data as expected. This thing gets most anoying, when you work on old data of unknown enoding or data sent from microcontrollers (which usually do not care about encoding at all) Nik > > > Regards, > rfg > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org" > -- Please do not email me anything that you are not comfortable also sharing with the NSA, CIA ...