Date: Thu, 18 Jun 2009 20:29:31 GMT From: Jay Patrick Howard <jhoward@alumni.utexas.net> To: freebsd-gnats-submit@FreeBSD.org Subject: kern/135718: [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems Message-ID: <200906182029.n5IKTVef007120@www.freebsd.org> Resent-Message-ID: <200906182030.n5IKU4HE065187@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 135718
>Category: kern
>Synopsis: [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Thu Jun 18 20:30:04 UTC 2009
>Closed-Date:
>Last-Modified:
>Originator: Jay Patrick Howard
>Release: n/a
>Organization:
>Environment:
>Description:
The stdlib qsort() code in FreeBSD is largely taken from the paper "Engineering a Sort Function" by Bentley & McIlroy (1993).
In that paper, the authors employ a clever a scheme for selecting a method of swapping elements. Basically it goes like this:
1. If the base pointer or element size is not aligned to sizeof(long) then swap byte-by-byte, else
2. If the element size is exactly sizeof(long) then perform a single inline swap, else
3. perform a long-by-long swap.
The implicit assumption here is that if the base pointer or element size isn't aligned to sizeof(long) then one can't do any better than a char-by-char swap.
While this seems to be true on most 32-bit systems, it is not the case on at least some 64-bit systems. In particular, x86-64.
Consequently, sorting data that is 32-bit aligned (but not 64-bit aligned) is much slower on 64-bit systems compared to 32-bit systems. This is because in 32-bit mode the qsort() logic uses a long-by-long swap (since the data is aligned) while in 64-bit mode qsort() drops down to a char-by-char swap.
It is true that most workloads on 64-bit systems will be 64-bit aligned. However, it is fairly common for 64-bit code to need to process binary data that wasn't generated on a 64-bit system and hence may not be 64-bit aligned. Because this is fairly common it could be a decent "win" for qsort() to support fast swapping for such 32-bit aligned workloads.
In my testing, sorting 64 MB worth of 100-byte records (each with a 12-byte key) took 1.8x as long on a 64-bit system as it did on a 32-bit system, with identical hardware. With a patched qsort the performance was identical on both 32-bit and 64-bit versions of the code.
The patch is written such that if sizeof(long) == sizeof(int) then it acts exactly like the current version. The additional swap behavior is only enabled when sizeof(long) > sizeof(int).
The extra overhead from the sizeof(int) alignment check was negligible. Given the way SWAPINIT is structured, there is no additional overhead whatsoever when the data is 64-bit aligned. The only time additional overhead is incurred is when data is NOT 64-bit aligned, in which case the extra alignment check quite is likely to provide a significant speedup.
>How-To-Repeat:
Sort records of size 8*n+4 bytes from within 64-bit code. Then sort the same data from within 32-bit code. The 64-bit version should take approximately 1.8x as long to execute.
>Fix:
See attached patch modifying /src/lib/libc/stdlib/qsort.c
Patch attached with submission follows:
--- qsort.c 2009-06-18 13:32:58.000000000 -0500
+++ qsort.c.patched 2009-06-18 14:22:02.000000000 -0500
@@ -34,6 +34,7 @@
__FBSDID("$FreeBSD: src/lib/libc/stdlib/qsort.c,v 1.15 2008/01/14 09:21:34 das Exp $");
#include <stdlib.h>
+#include <limits.h>
#ifdef I_AM_QSORT_R
typedef int cmp_t(void *, const void *, const void *);
@@ -59,8 +60,15 @@
} while (--i > 0); \
}
-#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
+#if LONG_BIT > WORD_BIT
+#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
+ es % sizeof(long) ? ((char *)a - (char *)0) % sizeof(int) || es % \
+ sizeof(int) ? 3 : 2 : es == sizeof(long)? 0 : 1;
+#else
+#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1;
+#endif
+
static inline void
swapfunc(a, b, n, swaptype)
@@ -69,6 +77,10 @@
{
if(swaptype <= 1)
swapcode(long, a, b, n)
+#if LONG_BIT > WORD_BIT
+ else if(swaptype <= 2)
+ swapcode(int, a, b, n)
+#endif
else
swapcode(char, a, b, n)
}
>Release-Note:
>Audit-Trail:
>Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200906182029.n5IKTVef007120>
