Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Jan 2012 22:39:37 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Ian Lepore <freebsd@damnhippie.dyndns.org>
Cc:        freebsd-arm@FreeBSD.org
Subject:   Re: Performance of SheevaPlug on 8-stable
Message-ID:  <F48E21E0-129A-418A-B147-7D5FB01160A8@bsdimp.com>
In-Reply-To: <1327980703.1662.240.camel@revolution.hippie.lan>
References:  <1327980703.1662.240.camel@revolution.hippie.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Ian,

Do you have any data on what 9.0 does?

Warner


On Jan 30, 2012, at 8:31 PM, Ian Lepore wrote:

> I would like to revive and add to this old topic.  I'm using the
> original subject line of the threads from long ago just to help out
> anyone searching for info in the future; we ran into the problem on
> Atmel at91rm9200 hardware rather than SheevaPlug.  The original =
threads
> are archived here:
>=20
>  http://lists.freebsd.org/pipermail/freebsd-arm/2010-March/002243.html
>  =
http://lists.freebsd.org/pipermail/freebsd-arm/2010-November/002635.html
>=20
> To summarize them... Ever since 8.0, performance of userland code on =
arm
> systems using VIVT cache ranges from bad to unusable, with symptoms =
that
> tend to be hard to nail down definitively.  Much of the evidence =
pointed
> to the instruction cache being disabled on some pages of apps and =
shared
> libraries, sometimes.  Mark Tinguely explained about how and why it's
> necessary to disable caching on a page when there are multiple =
mappings
> and at least one is a writable mapping.  There were some patches of
> pmap-layer code that had visible effects but never really fixed the
> problem.  I don't think anybody ever definitively nailed down why some
> executable pages seem to permanently lose their icache enable bit.
>=20
> I tracked down the cause and developed a workaround (I'll post =
patches),
> but to really fix the problem I would need a lot of help from VM/VFS
> gurus. =20
>=20
> I apologize in advance for a bit of hand-waving in what follows here.
> It was months ago that I was immersed in this problem; now I'm working
> from a few notes and a fading memory.  I figured I'd better just post
> before it fades completely, and hopefully some ensuing discussion will
> help me remember more details.  I also still have a couple sandboxes
> built with instrumented code that I could dust off and run with, to =
help
> answer any questions that arise.
>=20
> One of the most confusing symptoms of the problem is that performance
> can change from run to run, and most especially it can change after
> rebooting.  It turns out the run-to-run differences are based on what
> type of IO brought each executable page into memory. =20
>=20
> When portions of an executable file (including shared libs) are read =
or
> written using "normal IO" such as read(2), write(2), etc -- calls that
> end up in ffs_read() and ffs_write() -- a kernel-writable mapping for
> the pages is made before the physical IO is initiated, and that =
mapping
> stays in place and icache is disabled on those pages as long as the
> buffer remains in the cache (which for something like libc means
> forever).
>=20
> When pages are mapped as executable with mmap(2) and then the IO is =
done
> via demand paging when the pages are accessed, a temporary kernel
> writable mapping is made for the duration of the IO operation and then
> is removed again when the physical IO is completed (leaving just a
> read/execute mapping).  When the last writable mapping is removed the
> icache bit is restored on the page.
>=20
> (Semi-germane aside: the aio routines appear to work like the pager =
IO,
> making a temporary writable kva mapping only for the duration of the
> physical IO.)
>=20
> The cause of the variability in symptoms is a race between the two =
types
> of IO that happens when shared libs are loaded.  The race is kicked =
off
> by libexec/rtld-elf/map_object.c; it uses pread(2) to load the first =
4K
> of a file to read the headers so that it can mmap() the file as =
needed.
> The pread() eventually lands in ffs_read() which decides to do a =
cluster
> read or normal read-ahead.  Usually the read-ahead IO gets the blocks
> into the buffer cache (and thus disables icache on all those pages)
> before map_object() gets much work done, so the first part of a shared
> library usually ends up icache-disabled.  If it's a small shared lib =
the
> whole library may end up icache-disabled due to read-ahead. =20
>=20
> Other times it appears that map_object() gets the pages mapped and
> triggers demand-paging IO which completes before the readahead IO
> initiated by the first 4K read, and in those cases the icache bit on =
the
> pages gets turned back on when the temporary kernel mappings are
> unmapped.
>=20
> So when cluster or read-ahead IO wins the race, app performance is bad
> until the next reboot or filesystem unmount or something else pushes
> those blocks out of the buffer cache (which never happens on our
> embedded systems).  How badly the app performs depends on what shared
> libs it uses and the results of the races as each lib was loaded.  =
When
> some demand-paging IO completes before the corresponding read-ahead IO
> for the blocks at the start of a library, it seems to cause any =
further
> read-ahead to stop as I remember it, so the app doesn't take such a =
big
> performance hit, sometimes hardly any hit at all. =20
>=20
> In addition to the races on loading shared libs, doing "normal IO
> things" to executable files and libs, such as compiling a new copy or
> using cp or even 'cat app >/dev/null' which I think came up in the
> original thread, will cause that app to execute without icache on its
> pages until its blocks are pushed out of the buffer cache.
>=20
> Here's where I have to be extra-hand-wavy... I think the right way to
> fix this is to make ffs_read/write (or maybe I guess all vop_read and
> vop_write implementations) work more like aio and pager io in the =
sense
> that they should make a temporary kva mapping that lasts only as long =
as
> it takes to do the physical IO and associated uio operations.  I =
vaguely
> remember thinking that the place to make that happen was along the =
lines
> of doing the mapping in getblk() (or maybe breada()?) and unmapping it
> in bdone(), but I was quite frankly lost in that twisty maze of code =
and
> never felt like I understood it well enough to even make an =
experimental
> stab at such changes.
>=20
> I have two patches related to this stuff.  They were generated from =
8.2
> sources but I've confirmed that they apply properly to -current.
>=20
> One patch modifies map_object.c to use mmap()+memcpy() instead of
> pread().  I think it's a useful enhancement even without its effect on
> this icache problem, because it seems to me that doing a readahead on =
a
> shared library will bring in pages that may never be referenced and
> wouldn't have required any physical memory or IO resources if the
> readahead hadn't happened.
>=20
> The other is a pure hack-workaround that's most helpful when you're
> developing code for an arm platform.  It forces on the O_DIRECT flag =
in
> ffs_write() (and optionally ffs_read() but that's disabled by default)
> for executable files, to keep the blocks out of the buffer cache when
> doing normal IO stuff.  It's ugly brute force, but it's good enough to
> let us develop and deploy embedded systems code using FreeBSD 8.2.  =
This
> is not to be committed, this is just a workaround that let us start
> using 8.2 before finding a real fix to the root problem.  Anyone else
> trying to work with 8.0 or later on VIVT-cache arm chips might find it
> useful until a proper fix is developed.
>=20
> -- Ian
>=20
> --- sys/ufs/ffs/ffs_vnops.c	Thu Jun 16 14:43:20 2011 -0600
> +++ sys/ufs/ffs/ffs_vnops.c	Mon Jan 30 17:54:44 2012 -0700
> @@ -467,6 +467,18 @@ ffs_read(ap)
> 	seqcount =3D ap->a_ioflag >> IO_SEQSHIFT;
> 	ip =3D VTOI(vp);
>=20
> +	// This hack ensures that executable code never ends up in the =
buffer cache.
> +	// It is currently disabled.
> +	// It helps work around disabled-icache due to kernel-writable =
mappings.
> +	// However, shell script files are executable and caching them =
is useful, so
> +	// this is disabled for now.  With the rtld-elf mmap() patch in =
place,
> +	// nothing normally ever calls read on an executable file so =
this code
> +	// doesn't buy us much.
> +#if 0 && defined(__arm__)
> +	if (vp->v_type =3D=3D VREG && ip->i_mode & IEXEC)
> +		ioflag |=3D IO_DIRECT;
> +#endif   =20
> +
> #ifdef INVARIANTS
> 	if (uio->uio_rw !=3D UIO_READ)
> 		panic("ffs_read: mode");
> @@ -670,6 +682,17 @@ ffs_write(ap)
> 	seqcount =3D ap->a_ioflag >> IO_SEQSHIFT;
> 	ip =3D VTOI(vp);
>=20
> +	// This hack ensures that executable code never ends up in the =
buffer cache.
> +	// It helps work around disabled-icache due to kernel-writable =
mappings.
> +	// On a deployed production system, nothing normally ever calls =
write() on
> +	// an executable file.  This hack exists to allow development on =
the system
> +	// (so that you can do things like copy a new executable onto =
the system
> +	// without having that destroy performance on subsequent runs).
> +#if defined(__arm__)
> +	if (vp->v_type =3D=3D VREG && ip->i_mode & IEXEC)
> +		ioflag |=3D IO_DIRECT;
> +#endif
> +
> #ifdef INVARIANTS
> 	if (uio->uio_rw !=3D UIO_WRITE)
> 		panic("ffs_write: mode");
> diff -r 0cb0be36b70f libexec/rtld-elf/map_object.c
> --- libexec/rtld-elf/map_object.c	Thu Jun 16 14:43:20 2011 -0600
> +++ libexec/rtld-elf/map_object.c	Mon Jan 30 20:03:45 2012 -0700
> @@ -272,11 +272,16 @@ get_elf_header (int fd, const char *path
> 	char buf[PAGE_SIZE];
>     } u;
>     ssize_t nbytes;
> +    void *mapped;
>=20
> -    if ((nbytes =3D pread(fd, u.buf, PAGE_SIZE, 0)) =3D=3D -1) {
> -	_rtld_error("%s: read error: %s", path, strerror(errno));
> +    /* Use mmap() + memcpy() rather than [p]read() to avoid =
readahead. */
> +    nbytes =3D sizeof(u.buf);
> +    if ((mapped =3D mmap(NULL, nbytes, PROT_READ, 0, fd, 0)) =3D=3D =
(caddr_t) -1) {
> +	_rtld_error("%s: mmap of header failed: %s", path, =
strerror(errno));
> 	return NULL;
>     }
> +    memcpy(u.buf, mapped, nbytes);
> +    munmap(mapped, nbytes);
>=20
>     /* Make sure the file is valid */
>     if (nbytes < (ssize_t)sizeof(Elf_Ehdr) || !IS_ELF(u.hdr)) {
> _______________________________________________
> freebsd-arm@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arm
> To unsubscribe, send any mail to "freebsd-arm-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F48E21E0-129A-418A-B147-7D5FB01160A8>