Date: Mon, 30 Jan 2012 20:31:43 -0700 From: Ian Lepore <freebsd@damnhippie.dyndns.org> To: freebsd-arm@freebsd.org Subject: Performance of SheevaPlug on 8-stable Message-ID: <1327980703.1662.240.camel@revolution.hippie.lan>
next in thread | raw e-mail | index | archive | help
--=-qVk9FMQEACnfJqjUCHNo Content-Type: text/plain Content-Transfer-Encoding: 7bit I would like to revive and add to this old topic. I'm using the original subject line of the threads from long ago just to help out anyone searching for info in the future; we ran into the problem on Atmel at91rm9200 hardware rather than SheevaPlug. The original threads are archived here: http://lists.freebsd.org/pipermail/freebsd-arm/2010-March/002243.html http://lists.freebsd.org/pipermail/freebsd-arm/2010-November/002635.html To summarize them... Ever since 8.0, performance of userland code on arm systems using VIVT cache ranges from bad to unusable, with symptoms that tend to be hard to nail down definitively. Much of the evidence pointed to the instruction cache being disabled on some pages of apps and shared libraries, sometimes. Mark Tinguely explained about how and why it's necessary to disable caching on a page when there are multiple mappings and at least one is a writable mapping. There were some patches of pmap-layer code that had visible effects but never really fixed the problem. I don't think anybody ever definitively nailed down why some executable pages seem to permanently lose their icache enable bit. I tracked down the cause and developed a workaround (I'll post patches), but to really fix the problem I would need a lot of help from VM/VFS gurus. I apologize in advance for a bit of hand-waving in what follows here. It was months ago that I was immersed in this problem; now I'm working from a few notes and a fading memory. I figured I'd better just post before it fades completely, and hopefully some ensuing discussion will help me remember more details. I also still have a couple sandboxes built with instrumented code that I could dust off and run with, to help answer any questions that arise. One of the most confusing symptoms of the problem is that performance can change from run to run, and most especially it can change after rebooting. It turns out the run-to-run differences are based on what type of IO brought each executable page into memory. When portions of an executable file (including shared libs) are read or written using "normal IO" such as read(2), write(2), etc -- calls that end up in ffs_read() and ffs_write() -- a kernel-writable mapping for the pages is made before the physical IO is initiated, and that mapping stays in place and icache is disabled on those pages as long as the buffer remains in the cache (which for something like libc means forever). When pages are mapped as executable with mmap(2) and then the IO is done via demand paging when the pages are accessed, a temporary kernel writable mapping is made for the duration of the IO operation and then is removed again when the physical IO is completed (leaving just a read/execute mapping). When the last writable mapping is removed the icache bit is restored on the page. (Semi-germane aside: the aio routines appear to work like the pager IO, making a temporary writable kva mapping only for the duration of the physical IO.) The cause of the variability in symptoms is a race between the two types of IO that happens when shared libs are loaded. The race is kicked off by libexec/rtld-elf/map_object.c; it uses pread(2) to load the first 4K of a file to read the headers so that it can mmap() the file as needed. The pread() eventually lands in ffs_read() which decides to do a cluster read or normal read-ahead. Usually the read-ahead IO gets the blocks into the buffer cache (and thus disables icache on all those pages) before map_object() gets much work done, so the first part of a shared library usually ends up icache-disabled. If it's a small shared lib the whole library may end up icache-disabled due to read-ahead. Other times it appears that map_object() gets the pages mapped and triggers demand-paging IO which completes before the readahead IO initiated by the first 4K read, and in those cases the icache bit on the pages gets turned back on when the temporary kernel mappings are unmapped. So when cluster or read-ahead IO wins the race, app performance is bad until the next reboot or filesystem unmount or something else pushes those blocks out of the buffer cache (which never happens on our embedded systems). How badly the app performs depends on what shared libs it uses and the results of the races as each lib was loaded. When some demand-paging IO completes before the corresponding read-ahead IO for the blocks at the start of a library, it seems to cause any further read-ahead to stop as I remember it, so the app doesn't take such a big performance hit, sometimes hardly any hit at all. In addition to the races on loading shared libs, doing "normal IO things" to executable files and libs, such as compiling a new copy or using cp or even 'cat app >/dev/null' which I think came up in the original thread, will cause that app to execute without icache on its pages until its blocks are pushed out of the buffer cache. Here's where I have to be extra-hand-wavy... I think the right way to fix this is to make ffs_read/write (or maybe I guess all vop_read and vop_write implementations) work more like aio and pager io in the sense that they should make a temporary kva mapping that lasts only as long as it takes to do the physical IO and associated uio operations. I vaguely remember thinking that the place to make that happen was along the lines of doing the mapping in getblk() (or maybe breada()?) and unmapping it in bdone(), but I was quite frankly lost in that twisty maze of code and never felt like I understood it well enough to even make an experimental stab at such changes. I have two patches related to this stuff. They were generated from 8.2 sources but I've confirmed that they apply properly to -current. One patch modifies map_object.c to use mmap()+memcpy() instead of pread(). I think it's a useful enhancement even without its effect on this icache problem, because it seems to me that doing a readahead on a shared library will bring in pages that may never be referenced and wouldn't have required any physical memory or IO resources if the readahead hadn't happened. The other is a pure hack-workaround that's most helpful when you're developing code for an arm platform. It forces on the O_DIRECT flag in ffs_write() (and optionally ffs_read() but that's disabled by default) for executable files, to keep the blocks out of the buffer cache when doing normal IO stuff. It's ugly brute force, but it's good enough to let us develop and deploy embedded systems code using FreeBSD 8.2. This is not to be committed, this is just a workaround that let us start using 8.2 before finding a real fix to the root problem. Anyone else trying to work with 8.0 or later on VIVT-cache arm chips might find it useful until a proper fix is developed. -- Ian --=-qVk9FMQEACnfJqjUCHNo Content-Disposition: inline; filename="ffs_vnops_icache_hack.patch" Content-Type: text/x-patch; name="ffs_vnops_icache_hack.patch"; charset="us-ascii" Content-Transfer-Encoding: 7bit --- sys/ufs/ffs/ffs_vnops.c Thu Jun 16 14:43:20 2011 -0600 +++ sys/ufs/ffs/ffs_vnops.c Mon Jan 30 17:54:44 2012 -0700 @@ -467,6 +467,18 @@ ffs_read(ap) seqcount = ap->a_ioflag >> IO_SEQSHIFT; ip = VTOI(vp); + // This hack ensures that executable code never ends up in the buffer cache. + // It is currently disabled. + // It helps work around disabled-icache due to kernel-writable mappings. + // However, shell script files are executable and caching them is useful, so + // this is disabled for now. With the rtld-elf mmap() patch in place, + // nothing normally ever calls read on an executable file so this code + // doesn't buy us much. +#if 0 && defined(__arm__) + if (vp->v_type == VREG && ip->i_mode & IEXEC) + ioflag |= IO_DIRECT; +#endif + #ifdef INVARIANTS if (uio->uio_rw != UIO_READ) panic("ffs_read: mode"); @@ -670,6 +682,17 @@ ffs_write(ap) seqcount = ap->a_ioflag >> IO_SEQSHIFT; ip = VTOI(vp); + // This hack ensures that executable code never ends up in the buffer cache. + // It helps work around disabled-icache due to kernel-writable mappings. + // On a deployed production system, nothing normally ever calls write() on + // an executable file. This hack exists to allow development on the system + // (so that you can do things like copy a new executable onto the system + // without having that destroy performance on subsequent runs). +#if defined(__arm__) + if (vp->v_type == VREG && ip->i_mode & IEXEC) + ioflag |= IO_DIRECT; +#endif + #ifdef INVARIANTS if (uio->uio_rw != UIO_WRITE) panic("ffs_write: mode"); --=-qVk9FMQEACnfJqjUCHNo Content-Disposition: inline; filename="rtld-elf-map_object.patch" Content-Type: text/x-patch; name="rtld-elf-map_object.patch"; charset="us-ascii" Content-Transfer-Encoding: 7bit diff -r 0cb0be36b70f libexec/rtld-elf/map_object.c --- libexec/rtld-elf/map_object.c Thu Jun 16 14:43:20 2011 -0600 +++ libexec/rtld-elf/map_object.c Mon Jan 30 20:03:45 2012 -0700 @@ -272,11 +272,16 @@ get_elf_header (int fd, const char *path char buf[PAGE_SIZE]; } u; ssize_t nbytes; + void *mapped; - if ((nbytes = pread(fd, u.buf, PAGE_SIZE, 0)) == -1) { - _rtld_error("%s: read error: %s", path, strerror(errno)); + /* Use mmap() + memcpy() rather than [p]read() to avoid readahead. */ + nbytes = sizeof(u.buf); + if ((mapped = mmap(NULL, nbytes, PROT_READ, 0, fd, 0)) == (caddr_t) -1) { + _rtld_error("%s: mmap of header failed: %s", path, strerror(errno)); return NULL; } + memcpy(u.buf, mapped, nbytes); + munmap(mapped, nbytes); /* Make sure the file is valid */ if (nbytes < (ssize_t)sizeof(Elf_Ehdr) || !IS_ELF(u.hdr)) { --=-qVk9FMQEACnfJqjUCHNo--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1327980703.1662.240.camel>