Date: Mon, 11 Jun 2018 15:48:52 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kib@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334928 - head/lib/libc/stdlib Message-ID: <20180611141715.W1182@besplex.bde.org> In-Reply-To: <201806101754.w5AHsi4r061028@repo.freebsd.org> References: <201806101754.w5AHsi4r061028@repo.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 10 Jun 2018, Konstantin Belousov wrote: > Log: > libc qsort(3): stop aliasing. > > Qsort swap code aliases the sorted array elements to ints and longs in > order to do swap by machine words. Unfortunately this breaks with the > full code optimization, e.g. LTO. > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201 which seems to > reference code directly copied from libc/stdlib/qsort.c. This can be fixed without much churn or any loss of efficiency using memcpy(). However, optimization of this was silly, since he slow part of qsort() is doing many more comparisons than swaps, with relative slowness of comparisons guranteed by them being in extern functions. > Modified: head/lib/libc/stdlib/qsort.c > ============================================================================== > --- head/lib/libc/stdlib/qsort.c Sun Jun 10 16:44:18 2018 (r334927) > +++ head/lib/libc/stdlib/qsort.c Sun Jun 10 17:54:44 2018 (r334928) > @@ -43,53 +43,27 @@ typedef int cmp_t(void *, const void *, const void * > typedef int cmp_t(const void *, const void *); > #endif > static inline char *med3(char *, char *, char *, cmp_t *, void *); > -static inline void swapfunc(char *, char *, size_t, int, int); > > #define MIN(a, b) ((a) < (b) ? a : b) > > /* > * Qsort routine from Bentley & McIlroy's "Engineering a Sort Function". > */ > -#define swapcode(TYPE, parmi, parmj, n) { \ > - size_t i = (n) / sizeof (TYPE); \ > - TYPE *pi = (TYPE *) (parmi); \ > - TYPE *pj = (TYPE *) (parmj); \ > - do { \ > - TYPE t = *pi; \ > - *pi++ = *pj; \ > - *pj++ = t; \ > - } while (--i > 0); \ > -} I doubt that Bentley & McIlroy wrote this silly optimization. Change TYPE to unsigned char to fix this with minimal churn and the same pessimizations as in this commit. Change TYPE to unsigned char[sizeof(old TYPE)] and use memcpy() to assign the variables to fix this with small churn and without the pessimizations in this commit. This gives portable code back to K&R1978 + memcpy(), and is as efficient as the above when the above works. On x86, even gcc-3.3 produces a load and store for memcpy(p, q, sizeof(register_t)) when p and q are void *. Change TYPE to unsigned char[n] and use a single memcpy() without a loop to assign the variables to fix this with larger churn and with different (smaller?) pessimizations than in this commit. Setup and initialization for this method is also simpler. This uses the newfangled VLA feature, and since n is variable for qsort(), builtin memcpy() with length n doesn't work so well. > -#define SWAPINIT(TYPE, a, es) swaptype_ ## TYPE = \ > - ((char *)a - (char *)0) % sizeof(TYPE) || \ > - es % sizeof(TYPE) ? 2 : es == sizeof(TYPE) ? 0 : 1; - > static inline void > -swapfunc(char *a, char *b, size_t n, int swaptype_long, int swaptype_int) > +swapfunc(char *a, char *b, size_t es) > { > - if (swaptype_long <= 1) > - swapcode(long, a, b, n) > - else if (swaptype_int <= 1) > - swapcode(int, a, b, n) > - else > - swapcode(char, a, b, n) > + char t; > + > + do { > + t = *a; > + *a++ = *b; > + *b++ = t; > + } while (--es > 0); > } Copying bytewise asks for pessimal code. Since es is variable, the compiler can't usefully replace this by inline memcpy()'s, and it probably shouldn't bother combining the bytes either. clang actually combines the bytes and inlines the result, but doesn't inline the memcpy()s. > ... Test program: XX #include <string.h> XX XX volatile char qq, rr;; XX XX static inline void XX slowswap(void *iap, void *ibp, size_t len) XX { XX unsigned char *ap, *bp, tmp; XX XX for (ap = iap, bp = ibp; len != 0; ap++, bp++, len--) { XX tmp = *ap; XX *ap = *bp; XX *bp = tmp; XX } XX } XX XX static inline void XX swap(void *ap, void *bp, int len) XX { XX unsigned char tmp[len]; XX XX memcpy(&tmp, ap, len); XX memcpy(ap, bp, len); XX memcpy(bp, &tmp, len); XX } XX XX int a, b; XX void *p, *q; XX size_t plen; XX XX void XX foo(void) XX { XX swap(&a, &b, sizeof(a)); XX swap(p, q, plen); XX } XX XX XX void XX slowfoo(void) XX { XX slowswap(&a, &b, sizeof(a)); XX slowswap(p, q, plen); XX } clang generates very good code for swap() on small fixed-size data. It inlines the memcpy()s to load-store (using AVX for larger sizes if available). clang generates not so good code for slowswap() on small fixed-sized data. It inlines the 4-byte swap of 'a' and 'b', but does this weirdly using 16-bit and 8-bit register loads and stores instead of 32-bit ones. This takes 5 loads and 5 stores but should take 2 loads and 2 stores. clang generates not so good code for swap() on unknown-size data. It just calls memcpy() from inlined swap() into foo(). clang generates sophisticated code for slowswap() on unknown-size data. (All this is in x86, mostly on amd64.) Without AVX, but with XMM, it combines the bytes into blocks of size 64, 32, ..., 1 and inlines everything. This gives enormous code which is probably slower for small sizes than simpler methods. gcc-4.2.1 generates very bad code for swap() on small fixed-size data. It doesn't inline the memcpy()s or even swap(), and the code in the extern swap() has too much setup. Similarly for unknown-size data. The first case can be fixed using __always_inline instead of inline. Using macros instead of inlines as in qsort should work too. gcc-4.2.1 generates slow code for slowswap() on small fixed-size data. It just inlines slowswap() into a bytewise loop. Similarly for unknown- size data. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180611141715.W1182>