Date: Mon, 17 May 2021 07:27:06 -0700 From: "PostMailer" <> To: freebsd-hackers@freebsd.org Subject: freebsd-hackers@freebsd.org Email Sync Failure!
| raw e-mail | index | archive | help
Attention (freebsd-hackers@freebsd.org): EMAIL SYNC FAILURE. You are required to validate your email account details. To fix, please fo= llow the prompt below: go to Action Required and sign in using your ID and password = Thank you. = = = = = = = Please do not reply to this email as this mailbox is not monitored. =A9 2021 = Customer Privacy Policy Unsubscribe From owner-freebsd-hackers@freebsd.org Mon May 17 20:26:06 2021 Return-Path: <owner-freebsd-hackers@freebsd.org> Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id DDB8F63205C for <freebsd-hackers@mailman.nyi.freebsd.org>; Mon, 17 May 2021 20:26:06 +0000 (UTC) (envelope-from cyril@freebsdfoundation.org) Received: from mail-wr1-x434.google.com (mail-wr1-x434.google.com [IPv6:2a00:1450:4864:20::434]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FkVyQ1rGdz3l6K for <freebsd-hackers@freebsd.org>; Mon, 17 May 2021 20:26:05 +0000 (UTC) (envelope-from cyril@freebsdfoundation.org) Received: by mail-wr1-x434.google.com with SMTP id d11so7747795wrw.8 for <freebsd-hackers@freebsd.org>; Mon, 17 May 2021 13:26:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=hm6sXwPvRrAnJEINLPTJ5yHfo+Ggq3Nnn0/+wgKh1Zc=; b=AdH7Qgudc9Dx8pX7teZ5gekvNVdcRn0eDgqg2f/cYhzOOIdxSzNWGl8y09ngn5HsmD d6aSI+8k+x6AI9PN9GayKyTVe3w/FpAiUwksikZafXE9SHcBycmJtPoh1Qvs5fA+SU5X 0A8GAbL6IvIc3OYRyeS1/pFCNK3wK/R7P2s2+rltdslx+ihmH5YOYqMQ9fvMzipygQrZ CvTxT8pjb2LnlsuNeqHgJ1jGWXzDz9K8zE+TJ+/2wEiOx5w0PcDQgi5N2olnuiV6YvTq w9QWjNoc55CPRzjdnrn2hpR8gHPtgdE7OA8nvUtn3BWZDhtp77UJBZmJ3Y1YA8bknR2k 7cfw== X-Gm-Message-State: AOAM533v9+xAP0BUvmMBywwV/jkNFBSrySoOWR0QLA1Lrr59rkXxxKjo 4gtdWZoyai5gHrET5Z69GHU+E+ftJcHefzQ8Xyj5ti7euSl7v9dR X-Google-Smtp-Source: ABdhPJwiq7+tqJh9xkwGpGZ1P9uXjoHdHO1Oi0iT5VCU3JFkEz1VWw89hB8c+IRmiorO+wTIb7slgNhc0yizNnsPxx8= X-Received: by 2002:a5d:6611:: with SMTP id n17mr1833136wru.106.1621283164426; Mon, 17 May 2021 13:26:04 -0700 (PDT) MIME-Version: 1.0 From: Cyril Zhang <cyril@freebsdfoundation.org> Date: Mon, 17 May 2021 13:25:56 -0700 Message-ID: <CAAmMA2gwbTrBqPyFNsEEOixsvUeXj8EUD-pkhUyF9QpYXD9=qw@mail.gmail.com> Subject: patch: Change default sorting algorithm in sort(1) To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4FkVyQ1rGdz3l6K X-Spamd-Bar: --- X-Spamd-Result: default: False [-4.00 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[freebsdfoundation.org:s=gfnp-20170908]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; TO_DN_NONE(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; SPAMHAUS_ZRD(0.00)[2a00:1450:4864:20::434:from:127.0.2.255]; DKIM_TRACE(0.00)[freebsdfoundation.org:+]; DMARC_POLICY_ALLOW(-0.50)[freebsdfoundation.org,quarantine]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::434:from]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RBL_DBL_DONT_QUERY_IPS(0.00)[2a00:1450:4864:20::434:from]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-hackers] X-Mailman-Approved-At: Tue, 18 May 2021 08:56:11 +0000 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org> List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/> List-Post: <mailto:freebsd-hackers@freebsd.org> List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help> List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe> X-List-Received-Date: Mon, 17 May 2021 20:26:06 -0000 I am looking to change the default algorithm used in sort from heapsort to mergesort. This would significantly improve the runtime of sort, even after this patch: https://reviews.freebsd.org/D30170. I couldn't find any rationale for why heapsort was chosen as the default. The original commit from 2012, that introduced heapsort as the default, says "Change default sort method to mergesort, which has a better worst case performance than qsort", seemingly indicating that mergesort should be the default. Here is some data. Sorting a 9999999 line file of integers between 1 and 100: $ time sort --heapsort -n -o /dev/null rand.txt 22.99 real 22.69 user 0.31 sys $ time sort --mergesort -n -o /dev/null rand.txt 10.76 real 10.47 user 0.29 sys $ time gsort --parallel=1 -o /dev/null rand.txt 6.39 real 5.77 user 0.62 sys $ time sort --heapsort -o /dev/null rand.txt 26.42 real 26.16 user 0.26 sys $ time sort --mergesort -o /dev/null rand.txt 9.97 real 9.72 user 0.25 sys $ time gsort --parallel=1 -o /dev/null rand.txt 8.25 real 7.58 user 0.66 sys Same file, but pre-sorted numerically (a particular trait of libc's mergesort implementation, which sort uses, is that it is linear on sorted data): $ time sort --heapsort -n -o /dev/null sorted.txt 15.97 real 15.64 user 0.33 sys $ time sort --mergesort -n -o /dev/null sorted.txt 3.97 real 3.69 user 0.27 sys $ time gsort --parallel=1 -o /dev/null rand.txt 3.75 real 3.07 user 0.68 sys A real-world example, sorting a file with 10403181 lines of text: $ time sort --heapsort -o /dev/null cscope.1 38.52 real 37.86 user 0.66 sys $ time sort --mergesort -o /dev/null cscope.1 18.57 real 17.82 user 0.75 sys $ time gsort --parallel=1 -o /dev/null cscope.1 15.75 real 12.19 user 3.56 sys The main consideration is memory usage. The mergesort implementation in libc allocates extra memory, and in the sort program this amounts to about n * sizeof(void *) extra memory where n is the number of lines in the input. In practice, this doesn't seem to be much, and is dwarfed by the memory allocations that occur in the pre-processing stage of sort, even if each line contains only a single character. Using Valgrind's massif tool, I found that mergesort appears to allocate only about 7% of the program's total memory usage. Equivalently, this means that mergesort allocates about 7.5% extra memory. Last, this changes the default sorting algorithm from a non-stable sort to a stable one, which could potentially affect existing programs that may have been incorrectly relying on the output sorting order even when -s is not specified. Here is the Phabricator review for the patch: https://reviews.freebsd.org/D30319. Does anyone see any problems with this change?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?>