Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Sep 2016 05:37:19 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Emmanuel Vadot <manu@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r305814 - head/sys/boot/common
Message-ID:  <20160915042715.N2557@besplex.bde.org>
In-Reply-To: <201609141743.u8EHhX7R038840@repo.freebsd.org>
References:  <201609141743.u8EHhX7R038840@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 14 Sep 2016, Emmanuel Vadot wrote:

> Log:
>  ufsread: Do not cast struct direct from void *
>  This cause alignment problem on ARM (and possibly other archs), instead copy it.
>
>  MFC after:	1 week
>
> Modified:
>  head/sys/boot/common/ufsread.c
>
> Modified: head/sys/boot/common/ufsread.c
> ==============================================================================
> --- head/sys/boot/common/ufsread.c	Wed Sep 14 16:47:17 2016	(r305813)
> +++ head/sys/boot/common/ufsread.c	Wed Sep 14 17:43:32 2016	(r305814)
> @@ -97,21 +97,21 @@ static __inline uint8_t
> fsfind(const char *name, ufs_ino_t * ino)
> {
> 	static char buf[DEV_BSIZE];
> -	struct direct *d;
> +	static struct direct d;
> 	char *s;
> 	ssize_t n;

This looks like a good pessimization for space.  boot2 on i386 has to
fit in 8192 bytes and has a negative number to spare (features are
already left out).

With auto buffers, and also builtin memcpy, the compiler can optimize
the memcpy to nothing on arches with no alignment, and should do this
with -Os.  I think -ffreestanding in boot programs kills the builtin,
and this is almost intentional since compilers have poor support for
-Os so the inline memcpy is sometimes larger unless it is null.

>
> 	fs_off = 0;
> 	while ((n = fsread(*ino, buf, DEV_BSIZE)) > 0)
> 		for (s = buf; s < buf + DEV_BSIZE;) {
> -			d = (void *)s;
> +			memcpy(&d, s, sizeof(struct direct));
> 			if (ls)
> -				printf("%s ", d->d_name);
> -			else if (!strcmp(name, d->d_name)) {
> -				*ino = d->d_ino;
> -				return d->d_type;
> +				printf("%s ", d.d_name);
> +			else if (!strcmp(name, d.d_name)) {
> +				*ino = d.d_ino;
> +				return d.d_type;
> 			}
> -			s += d->d_reclen;
> +			s += d.d_reclen;
> 		}
> 	if (n != -1 && ls)
> 		printf("\n");

The static buffer in the old code also looks like a pessimization for
space.  It doesn't seem to be needed to preserver or return results.
Compilers don't seem to be smart enough to see this and optimize it
to auto or vice versa.

Testing shows that the static buffer is space optimization for gcc
and a space pessimization for clang:

X #include <string.h>
X #include <unistd.h>
X 
X int
X foo(void)
X {
X 	static char buf[101];
X 	static int buf1[3];
X 
X 	read(0, buf, sizeof(buf));
X 	memcpy(buf1, &buf[2], sizeof(buf1));
X 	return (buf[0] + buf1[0]);
X }

2 static buffers give the best pessimizations for clang -Os -m32.
Even with -Os, clang prefers to use large instructions to reduce the
number of instructions.  The static buffers give largest instructions,
and also prevent optimizing away the memcpy().  With auto buf1, it
takes -ffreestanding to kill the optimization of memcpy() and with
that optimization the size is reduced by almost a factor of 2.  With
auto buf too, other pessimizations are applied to give a size reduction
of just 1 byte.  gcc-4.2.1 never inlines the memcpy(), but it does
the other pessimizations less well to give an in-between size.

Aproximate sizes on i386: 65 bytes for clang with statics, 49 for gcc,
33 for clang with autos and builtins, 28 for tweaked output for clang
with static buf, auto buf1 and builtins (null memcpy).

Probably the static buf is an intentional optimization for gcc.
Compilers do too good a job of generating large code for large stack
offsets.  In my test code, the offsets are about 101 which is less than
128 so it takes only 2 bytes.  DEV_BSIZE is 512 so it needs 5-byte
offsets from the usual pessimizations for the base pointer (these
are to point %ebp at the top of the variables and put all the scalar
variables below the buffer so that the offsets are largest).  But
clang also does good pessimizations for offsets from the static buffer:

 	pushl	$foo.buf	# 5 bytes
 	movsbl	foo.buf, %eax	# 7 bytes
 	addl	foo.buf+2, %eax	# 6 bytes

Actually optimizing for space:

 	movl	$foo.buf,%ebx	# 5 bytes
 	pushl	%ebx		# 1 byte
 	movsbl	(%ebx), %eax	# 3 bytes
 	addl	2(%ebx), %eax	# 3 bytes

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160915042715.N2557>