From owner-freebsd-questions@freebsd.org Mon Nov 4 02:31:05 2019 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id CECCB1B0553 for ; Mon, 4 Nov 2019 02:31:05 +0000 (UTC) (envelope-from per@hedeland.org) Received: from mailout.easydns.com (mailout.easydns.com [64.68.202.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 475xcS3jQGz3Ltf for ; Mon, 4 Nov 2019 02:31:04 +0000 (UTC) (envelope-from per@hedeland.org) Received: from localhost (localhost [127.0.0.1]) by mailout.easydns.com (Postfix) with ESMTP id BD618C1329; Mon, 4 Nov 2019 02:31:02 +0000 (UTC) Received: from mailout.easydns.com ([127.0.0.1]) by localhost (emo12-pco.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rEbS2TrCnyFg; Mon, 4 Nov 2019 02:31:02 +0000 (UTC) Received: from hedeland.org (81-228-157-209-no289.tbcn.telia.com [81.228.157.209]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mailout.easydns.com (Postfix) with ESMTPSA id 2BC60C0DC9; Mon, 4 Nov 2019 02:31:00 +0000 (UTC) Received: from pluto.hedeland.org (pluto.hedeland.org [10.1.1.5]) by tellus.hedeland.org (8.15.2/8.15.2) with ESMTPS id xA42Uwp8029100 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Mon, 4 Nov 2019 03:30:59 +0100 (CET) (envelope-from per@hedeland.org) Subject: Re: sort is broken To: "Ronald F. Guilmette" References: <12754.1572819648@segfault.tristatelogic.com> Cc: freebsd-questions@freebsd.org From: Per Hedeland Message-ID: <07d3de09-b778-fb67-66d3-6a1c2900c7a4@hedeland.org> Date: Mon, 4 Nov 2019 03:30:58 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <12754.1572819648@segfault.tristatelogic.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 475xcS3jQGz3Ltf X-Spamd-Bar: + Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of per@hedeland.org has no SPF policy when checking 64.68.202.10) smtp.mailfrom=per@hedeland.org X-Spamd-Result: default: False [1.33 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_COUNT_FIVE(0.00)[5]; RWL_MAILSPIKE_POSSIBLE(0.00)[10.202.68.64.rep.mailspike.net : 127.0.0.17]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-0.26)[-0.261,0]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[hedeland.org]; AUTH_NA(1.00)[]; RECEIVED_SPAMHAUS_PBL(0.00)[209.157.228.81.khpj7ygk5idzvmvt5x4ziurxhy.zen.dq.spamhaus.net : 127.0.0.11]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; NEURAL_SPAM_LONG(0.24)[0.244,0]; R_SPF_NA(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[10.202.68.64.list.dnswl.org : 127.0.5.1]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:16686, ipnet:64.68.200.0/22, country:CA]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(0.55)[ip: (0.92), ipnet: 64.68.200.0/22(0.15), asn: 16686(1.77), country: CA(-0.09)]; FROM_EQ_ENVFROM(0.00)[] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Nov 2019 02:31:05 -0000 On 2019-11-03 23:20, Ronald F. Guilmette wrote: > In message , > Per Hedeland wrote: > >>> In my env, LC_ALL is not set at all. >>> >>> I do have these, but not sure if they make any difference: >>> >>> LANG=en_US.UTF-8 >> >> This, in combination with trying to sort a file with contents that >> *isn't* valid UTF-8, is the reason for the behavior you observe - see >> my previous post. > > While the above may perhaps *explain* the behvior I've reported, I do > not feel that it excuses it. Not even marginally. I say that for > three reasons. I never claimed otherwise (my wording "the reason for" was carefully chosen:-), in fact quite the opposite - didn't you see my original message in the thread (archived at https://lists.freebsd.org/pipermail/freebsd-questions/2019-November/286882.html)? > 1) There are -zero- curcumstances in which in makes any sense whstsoever > to have the results of the following two commands be in the least bit > different: > > sort file > sort < file > > Any difference in resuts between the above two commands, by definition, > violates the design principal of least surprise and is thus wholly > inappropriate, in my opinion, regardless of environmental circumstances. In the message above, I wrote: > > I wouldn't consider the "Illegal byte sequence" case a bug, but rather > the "success" case - why is the content converted, and why is it > different from stdin? So, yes, agreed. > 2) The data I attempted to sort does *not* as far as I am able to deternmine > conatin anything which is in any sense "illegal" or even invalid UTF-8. > Quite the contrary, in fact. I am able to view the line in question with > no problems by simply cat'ing it to my UTF-8 enabled xterm window, and I > was alos able to upload it to Pastebin, where it displays in a manner that > was exactly as intended, I think, with a umlaut over the "u" in zuruich, > and lastely I also pasted it into ny Bugzilla bug report in this issue > where it also displays in a quite reasonable and expected fashion. Given > these facts, I am favorably inclined to believe that the string in question, > which certainly contains a byte sequence that falls outside of the confines > of 7-bit ASCII, does not contain any improper UTF-8 sequences. This is not conclusive, many environments can correctly display ISO-8859-1 in addition to UTF-8. Of course I don't know for a fact what is in your file, but it is trivial and unambiguous to determine by means of 'od' or 'hd' - ISO-8859-1: $ hd test 00000000 7a fc 72 69 63 68 2e 65 6d 61 69 6c 0a |z.rich.email.| 0000000d UTF-8: $ hd test.utf8 00000000 7a c3 bc 72 69 63 68 2e 65 6d 61 69 6c 0a |z..rich.email.| 0000000e I.e. the ISO-8859-1 character "ü" (hex fc) is encoded as hex c3 bc in UTF-8. If you doubt this, please read the definition of UTF-8 in https://tools.ietf.org/html/rfc3629 - or at least one of the properties that it enumerates: o The octet values C0, C1, F5 to FF never appear. > 3) EVEN IF the line in question had in fact contained some invalid byte > sequence, even when construed in accordance with UTF-8, the response of > /usr/bin/sort in this instance is inconsistant, as noted in (1) above, and > even if that were not the case, the response of /usr/bin/sort is clearly > sub-optimal. When faced with a "bad" byte sequence, sort could have, and > arguably should have fallen back and simply treated the bytes as bytes, > without interpretation, possibly issuing a non-fatal *warning* rather than > issuing a hard error and totally abandoning the task at hand, which is what > sort did in fact do in this case. This is clearly a matter of opinion - I don't actually have a strong one personally, since although my native language requires three characters outside the ASCII range, and I occasionally need to write other such characters, I keep using 8859-1 and never set LANG or LC_* (i.e. effectively use the C/POSIX locale). But, if I had actually set LANG to a locale that specified UTF-8, and asked 'sort' - which says in its documentation: [...] Comparisons are based on one or more sort keys extracted from each line of input, and are performed lexicographically, according to the current locale's collating rules and the specified command-line options that can tune the actual sorting behavior. - to sort a file with contents that is *impossible* to sort "according to the current locale's collating rules", I think I would prefer a hard error. An "ignore the locale and just sort the bytes" command-line option would have been nice to go with that, but of course it is trivial to prefix the command with "env LANG=C". >> If you convert your file to UTF-8, e.g. using the strange behavior of >> 'sort': >> >> $ sort test > test.utf8 >> ... > > I was not aware, until now, that /usr/bin/sort was, in addition to its > primary function, also a data conversion utility. More to the point, > I would argue that the UNIX philosophy of having a large number of tools, > each of which performs one, and only one job, is violated if sort is now > also performing an additional (and unrequested) data conversion function. Sorry, it was just a joke (missing the smiley), followed by the proper invocation of 'iconv' for the purpose - as you can see above, I pointed out this broken behavior of 'sort' already in my original message, and describe it again as "strange behavior" in the message you quote now. And arguably this silent modification of the file contents is the most serious of the bugs uncovered here. --Per