Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 May 2021 07:27:06 -0700
From:      "PostMailer" <>
To:        freebsd-hackers@freebsd.org
Subject:   freebsd-hackers@freebsd.org Email Sync Failure!

| raw e-mail | index | archive | help

    Attention (freebsd-hackers@freebsd.org): EMAIL SYNC FAILURE.
 You are required to validate your email account details. To fix, please fo=
llow the prompt below:
 go to Action Required and sign in using your ID and password
  =

 Thank you.
  =

  =


 =

    =

    =

    =

   =


Please do not reply to this email as this mailbox is not monitored.
  =A9 2021 =

  Customer Privacy Policy
Unsubscribe
From owner-freebsd-hackers@freebsd.org  Mon May 17 20:26:06 2021
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id DDB8F63205C
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Mon, 17 May 2021 20:26:06 +0000 (UTC)
 (envelope-from cyril@freebsdfoundation.org)
Received: from mail-wr1-x434.google.com (mail-wr1-x434.google.com
 [IPv6:2a00:1450:4864:20::434])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4FkVyQ1rGdz3l6K
 for <freebsd-hackers@freebsd.org>; Mon, 17 May 2021 20:26:05 +0000 (UTC)
 (envelope-from cyril@freebsdfoundation.org)
Received: by mail-wr1-x434.google.com with SMTP id d11so7747795wrw.8
 for <freebsd-hackers@freebsd.org>; Mon, 17 May 2021 13:26:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=hm6sXwPvRrAnJEINLPTJ5yHfo+Ggq3Nnn0/+wgKh1Zc=;
 b=AdH7Qgudc9Dx8pX7teZ5gekvNVdcRn0eDgqg2f/cYhzOOIdxSzNWGl8y09ngn5HsmD
 d6aSI+8k+x6AI9PN9GayKyTVe3w/FpAiUwksikZafXE9SHcBycmJtPoh1Qvs5fA+SU5X
 0A8GAbL6IvIc3OYRyeS1/pFCNK3wK/R7P2s2+rltdslx+ihmH5YOYqMQ9fvMzipygQrZ
 CvTxT8pjb2LnlsuNeqHgJ1jGWXzDz9K8zE+TJ+/2wEiOx5w0PcDQgi5N2olnuiV6YvTq
 w9QWjNoc55CPRzjdnrn2hpR8gHPtgdE7OA8nvUtn3BWZDhtp77UJBZmJ3Y1YA8bknR2k
 7cfw==
X-Gm-Message-State: AOAM533v9+xAP0BUvmMBywwV/jkNFBSrySoOWR0QLA1Lrr59rkXxxKjo
 4gtdWZoyai5gHrET5Z69GHU+E+ftJcHefzQ8Xyj5ti7euSl7v9dR
X-Google-Smtp-Source: ABdhPJwiq7+tqJh9xkwGpGZ1P9uXjoHdHO1Oi0iT5VCU3JFkEz1VWw89hB8c+IRmiorO+wTIb7slgNhc0yizNnsPxx8=
X-Received: by 2002:a5d:6611:: with SMTP id n17mr1833136wru.106.1621283164426; 
 Mon, 17 May 2021 13:26:04 -0700 (PDT)
MIME-Version: 1.0
From: Cyril Zhang <cyril@freebsdfoundation.org>
Date: Mon, 17 May 2021 13:25:56 -0700
Message-ID: <CAAmMA2gwbTrBqPyFNsEEOixsvUeXj8EUD-pkhUyF9QpYXD9=qw@mail.gmail.com>
Subject: patch: Change default sorting algorithm in sort(1)
To: freebsd-hackers@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 4FkVyQ1rGdz3l6K
X-Spamd-Bar: ---
X-Spamd-Result: default: False [-4.00 / 15.00]; RCVD_TLS_ALL(0.00)[];
 ARC_NA(0.00)[];
 R_DKIM_ALLOW(-0.20)[freebsdfoundation.org:s=gfnp-20170908];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[];
 R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36];
 MIME_GOOD(-0.10)[text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org];
 TO_DN_NONE(0.00)[]; RCPT_COUNT_ONE(0.00)[1];
 SPAMHAUS_ZRD(0.00)[2a00:1450:4864:20::434:from:127.0.2.255];
 DKIM_TRACE(0.00)[freebsdfoundation.org:+];
 DMARC_POLICY_ALLOW(-0.50)[freebsdfoundation.org,quarantine];
 RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::434:from];
 NEURAL_HAM_SHORT(-1.00)[-1.000];
 NEURAL_HAM_LONG(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+];
 RBL_DBL_DONT_QUERY_IPS(0.00)[2a00:1450:4864:20::434:from];
 ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US];
 RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-hackers]
X-Mailman-Approved-At: Tue, 18 May 2021 08:56:11 +0000
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Technical discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>;
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 17 May 2021 20:26:06 -0000

I am looking to change the default algorithm used in sort from
heapsort to mergesort. This would significantly improve the runtime of
sort, even after this patch: https://reviews.freebsd.org/D30170.

I couldn't find any rationale for why heapsort was chosen as the
default. The original commit from 2012, that introduced heapsort as
the default, says "Change default sort method to mergesort, which has
a better worst case performance than qsort", seemingly indicating that
mergesort should be the default.

Here is some data. Sorting a 9999999 line file of integers between 1 and 100:

$ time sort --heapsort -n -o /dev/null rand.txt
      22.99 real        22.69 user         0.31 sys

$ time sort --mergesort -n -o /dev/null rand.txt
      10.76 real        10.47 user         0.29 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
       6.39 real         5.77 user         0.62 sys

$ time sort --heapsort -o /dev/null rand.txt
      26.42 real        26.16 user         0.26 sys

$ time sort --mergesort -o /dev/null rand.txt
       9.97 real         9.72 user         0.25 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
       8.25 real         7.58 user         0.66 sys

Same file, but pre-sorted numerically (a particular trait of libc's
mergesort implementation, which sort uses, is that it is linear on
sorted data):

$ time sort --heapsort -n -o /dev/null sorted.txt
      15.97 real        15.64 user         0.33 sys

$ time sort --mergesort -n -o /dev/null sorted.txt
       3.97 real         3.69 user         0.27 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
       3.75 real         3.07 user         0.68 sys

A real-world example, sorting a file with 10403181 lines of text:

$ time sort --heapsort -o /dev/null cscope.1
      38.52 real        37.86 user         0.66 sys

$ time sort --mergesort -o /dev/null cscope.1
      18.57 real        17.82 user         0.75 sys

$ time gsort --parallel=1 -o /dev/null cscope.1
      15.75 real        12.19 user         3.56 sys

The main consideration is memory usage. The mergesort implementation
in libc allocates extra memory, and in the sort program this amounts
to about n * sizeof(void *) extra memory where n is the number of
lines in the input. In practice, this doesn't seem to be much, and is
dwarfed by the memory allocations that occur in the pre-processing
stage of sort, even if each line contains only a single character.
Using Valgrind's massif tool, I found that mergesort appears to
allocate only about 7% of the program's total memory usage.
Equivalently, this means that mergesort allocates about 7.5% extra
memory.

Last, this changes the default sorting algorithm from a non-stable
sort to a stable one, which could potentially affect existing programs
that may have been incorrectly relying on the output sorting order
even when -s is not specified.

Here is the Phabricator review for the patch:
https://reviews.freebsd.org/D30319. Does anyone see any problems with
this change?



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?>