Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Jun 2018 15:48:52 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kib@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r334928 - head/lib/libc/stdlib
Message-ID:  <20180611141715.W1182@besplex.bde.org>
In-Reply-To: <201806101754.w5AHsi4r061028@repo.freebsd.org>
References:  <201806101754.w5AHsi4r061028@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 10 Jun 2018, Konstantin Belousov wrote:

> Log:
>  libc qsort(3): stop aliasing.
>
>  Qsort swap code aliases the sorted array elements to ints and longs in
>  order to do swap by machine words.  Unfortunately this breaks with the
>  full code optimization, e.g. LTO.
>
>  See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201 which seems to
>  reference code directly copied from libc/stdlib/qsort.c.

This can be fixed without much churn or any loss of efficiency using
memcpy().

However, optimization of this was silly, since he slow part of qsort() is
doing many more comparisons than swaps, with relative slowness of comparisons
guranteed by them being in extern functions.

> Modified: head/lib/libc/stdlib/qsort.c
> ==============================================================================
> --- head/lib/libc/stdlib/qsort.c	Sun Jun 10 16:44:18 2018	(r334927)
> +++ head/lib/libc/stdlib/qsort.c	Sun Jun 10 17:54:44 2018	(r334928)
> @@ -43,53 +43,27 @@ typedef int		 cmp_t(void *, const void *, const void *
> typedef int		 cmp_t(const void *, const void *);
> #endif
> static inline char	*med3(char *, char *, char *, cmp_t *, void *);
> -static inline void	 swapfunc(char *, char *, size_t, int, int);
>
> #define	MIN(a, b)	((a) < (b) ? a : b)
>
> /*
>  * Qsort routine from Bentley & McIlroy's "Engineering a Sort Function".
>  */
> -#define	swapcode(TYPE, parmi, parmj, n) {		\
> -	size_t i = (n) / sizeof (TYPE);			\
> -	TYPE *pi = (TYPE *) (parmi);		\
> -	TYPE *pj = (TYPE *) (parmj);		\
> -	do { 						\
> -		TYPE	t = *pi;		\
> -		*pi++ = *pj;				\
> -		*pj++ = t;				\
> -	} while (--i > 0);				\
> -}

I doubt that Bentley & McIlroy wrote this silly optimization.

Change TYPE to unsigned char to fix this with minimal churn and the same
pessimizations as in this commit.

Change TYPE to unsigned char[sizeof(old TYPE)] and use memcpy() to assign
the variables to fix this with small churn and without the pessimizations
in this commit.  This gives portable code back to K&R1978 + memcpy(), and
is as efficient as the above when the above works.  On x86, even gcc-3.3
produces a load and store for memcpy(p, q, sizeof(register_t)) when p and
q are void *.

Change TYPE to unsigned char[n] and use a single memcpy() without a loop
to assign the variables to fix this with larger churn and with different
(smaller?) pessimizations than in this commit.  Setup and initialization
for this method is also simpler.  This uses the newfangled VLA feature,
and since n is variable for qsort(), builtin memcpy() with length n
doesn't work so well.

> -#define	SWAPINIT(TYPE, a, es) swaptype_ ## TYPE =	\
> -	((char *)a - (char *)0) % sizeof(TYPE) ||	\
> -	es % sizeof(TYPE) ? 2 : es == sizeof(TYPE) ? 0 : 1;
-
> static inline void
> -swapfunc(char *a, char *b, size_t n, int swaptype_long, int swaptype_int)
> +swapfunc(char *a, char *b, size_t es)
> {
> -	if (swaptype_long <= 1)
> -		swapcode(long, a, b, n)
> -	else if (swaptype_int <= 1)
> -		swapcode(int, a, b, n)
> -	else
> -		swapcode(char, a, b, n)
> +	char t;
> +
> +	do {
> +		t = *a;
> +		*a++ = *b;
> +		*b++ = t;
> +	} while (--es > 0);
> }

Copying bytewise asks for pessimal code.  Since es is variable, the
compiler can't usefully replace this by inline memcpy()'s, and it probably
shouldn't bother combining the bytes either.  clang actually combines the
bytes and inlines the result, but doesn't inline the memcpy()s.

> ...

Test program:

XX #include <string.h>
XX 
XX volatile char qq, rr;;
XX 
XX static inline void
XX slowswap(void *iap, void *ibp, size_t len)
XX {
XX 	unsigned char *ap, *bp, tmp;
XX 
XX 	for (ap = iap, bp = ibp; len != 0; ap++, bp++, len--) {
XX 		tmp = *ap;
XX 		*ap = *bp;
XX 		*bp = tmp;
XX 	}
XX }
XX 
XX static inline void
XX swap(void *ap, void *bp, int len)
XX {
XX 	unsigned char tmp[len];
XX 
XX 	memcpy(&tmp, ap, len);
XX 	memcpy(ap, bp, len);
XX 	memcpy(bp, &tmp, len);
XX }
XX 
XX int a, b;
XX void *p, *q;
XX size_t plen;
XX 
XX void
XX foo(void)
XX {
XX 	swap(&a, &b, sizeof(a));
XX 	swap(p, q, plen);
XX }
XX 
XX 
XX void
XX slowfoo(void)
XX {
XX 	slowswap(&a, &b, sizeof(a));
XX 	slowswap(p, q, plen);
XX }

clang generates very good code for swap() on small fixed-size data.  It
inlines the memcpy()s to load-store (using AVX for larger sizes if available).

clang generates not so good code for slowswap() on small fixed-sized data.
It inlines the 4-byte swap of 'a' and 'b', but does this weirdly using
16-bit and 8-bit register loads and stores instead of 32-bit ones.  This
takes 5 loads and 5 stores but should take 2 loads and 2 stores.

clang generates not so good code for swap() on unknown-size data.  It
just calls memcpy() from inlined swap() into foo().

clang generates sophisticated code for slowswap() on unknown-size data.
(All this is in x86, mostly on amd64.)  Without AVX, but with XMM, it
combines the bytes into blocks of size 64, 32, ..., 1 and inlines
everything.  This gives enormous code which is probably slower for
small sizes than simpler methods.

gcc-4.2.1 generates very bad code for swap() on small fixed-size data.
It doesn't inline the memcpy()s or even swap(), and the code in the
extern swap() has too much setup.  Similarly for unknown-size data.
The first case can be fixed using __always_inline instead of inline.
Using macros instead of inlines as in qsort should work too.

gcc-4.2.1 generates slow code for slowswap() on small fixed-size data.
It just inlines slowswap() into a bytewise loop.  Similarly for unknown-
size data.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180611141715.W1182>