From owner-freebsd-arm@FreeBSD.ORG Tue Jan 31 05:45:46 2012 Return-Path: Delivered-To: freebsd-arm@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 13183106564A for ; Tue, 31 Jan 2012 05:45:46 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 8380E8FC14 for ; Tue, 31 Jan 2012 05:45:45 +0000 (UTC) Received: from 63.imp.bsdimp.com (63.imp.bsdimp.com [10.0.0.63]) (authenticated bits=0) by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q0V5dcrA013305 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO); Mon, 30 Jan 2012 22:39:38 -0700 (MST) (envelope-from imp@bsdimp.com) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: <1327980703.1662.240.camel@revolution.hippie.lan> Date: Mon, 30 Jan 2012 22:39:37 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1327980703.1662.240.camel@revolution.hippie.lan> To: Ian Lepore X-Mailer: Apple Mail (2.1084) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (harmony.bsdimp.com [10.0.0.6]); Mon, 30 Jan 2012 22:39:39 -0700 (MST) Cc: freebsd-arm@FreeBSD.org Subject: Re: Performance of SheevaPlug on 8-stable X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the StrongARM Processor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Jan 2012 05:45:46 -0000 Hi Ian, Do you have any data on what 9.0 does? Warner On Jan 30, 2012, at 8:31 PM, Ian Lepore wrote: > I would like to revive and add to this old topic. I'm using the > original subject line of the threads from long ago just to help out > anyone searching for info in the future; we ran into the problem on > Atmel at91rm9200 hardware rather than SheevaPlug. The original = threads > are archived here: >=20 > http://lists.freebsd.org/pipermail/freebsd-arm/2010-March/002243.html > = http://lists.freebsd.org/pipermail/freebsd-arm/2010-November/002635.html >=20 > To summarize them... Ever since 8.0, performance of userland code on = arm > systems using VIVT cache ranges from bad to unusable, with symptoms = that > tend to be hard to nail down definitively. Much of the evidence = pointed > to the instruction cache being disabled on some pages of apps and = shared > libraries, sometimes. Mark Tinguely explained about how and why it's > necessary to disable caching on a page when there are multiple = mappings > and at least one is a writable mapping. There were some patches of > pmap-layer code that had visible effects but never really fixed the > problem. I don't think anybody ever definitively nailed down why some > executable pages seem to permanently lose their icache enable bit. >=20 > I tracked down the cause and developed a workaround (I'll post = patches), > but to really fix the problem I would need a lot of help from VM/VFS > gurus. =20 >=20 > I apologize in advance for a bit of hand-waving in what follows here. > It was months ago that I was immersed in this problem; now I'm working > from a few notes and a fading memory. I figured I'd better just post > before it fades completely, and hopefully some ensuing discussion will > help me remember more details. I also still have a couple sandboxes > built with instrumented code that I could dust off and run with, to = help > answer any questions that arise. >=20 > One of the most confusing symptoms of the problem is that performance > can change from run to run, and most especially it can change after > rebooting. It turns out the run-to-run differences are based on what > type of IO brought each executable page into memory. =20 >=20 > When portions of an executable file (including shared libs) are read = or > written using "normal IO" such as read(2), write(2), etc -- calls that > end up in ffs_read() and ffs_write() -- a kernel-writable mapping for > the pages is made before the physical IO is initiated, and that = mapping > stays in place and icache is disabled on those pages as long as the > buffer remains in the cache (which for something like libc means > forever). >=20 > When pages are mapped as executable with mmap(2) and then the IO is = done > via demand paging when the pages are accessed, a temporary kernel > writable mapping is made for the duration of the IO operation and then > is removed again when the physical IO is completed (leaving just a > read/execute mapping). When the last writable mapping is removed the > icache bit is restored on the page. >=20 > (Semi-germane aside: the aio routines appear to work like the pager = IO, > making a temporary writable kva mapping only for the duration of the > physical IO.) >=20 > The cause of the variability in symptoms is a race between the two = types > of IO that happens when shared libs are loaded. The race is kicked = off > by libexec/rtld-elf/map_object.c; it uses pread(2) to load the first = 4K > of a file to read the headers so that it can mmap() the file as = needed. > The pread() eventually lands in ffs_read() which decides to do a = cluster > read or normal read-ahead. Usually the read-ahead IO gets the blocks > into the buffer cache (and thus disables icache on all those pages) > before map_object() gets much work done, so the first part of a shared > library usually ends up icache-disabled. If it's a small shared lib = the > whole library may end up icache-disabled due to read-ahead. =20 >=20 > Other times it appears that map_object() gets the pages mapped and > triggers demand-paging IO which completes before the readahead IO > initiated by the first 4K read, and in those cases the icache bit on = the > pages gets turned back on when the temporary kernel mappings are > unmapped. >=20 > So when cluster or read-ahead IO wins the race, app performance is bad > until the next reboot or filesystem unmount or something else pushes > those blocks out of the buffer cache (which never happens on our > embedded systems). How badly the app performs depends on what shared > libs it uses and the results of the races as each lib was loaded. = When > some demand-paging IO completes before the corresponding read-ahead IO > for the blocks at the start of a library, it seems to cause any = further > read-ahead to stop as I remember it, so the app doesn't take such a = big > performance hit, sometimes hardly any hit at all. =20 >=20 > In addition to the races on loading shared libs, doing "normal IO > things" to executable files and libs, such as compiling a new copy or > using cp or even 'cat app >/dev/null' which I think came up in the > original thread, will cause that app to execute without icache on its > pages until its blocks are pushed out of the buffer cache. >=20 > Here's where I have to be extra-hand-wavy... I think the right way to > fix this is to make ffs_read/write (or maybe I guess all vop_read and > vop_write implementations) work more like aio and pager io in the = sense > that they should make a temporary kva mapping that lasts only as long = as > it takes to do the physical IO and associated uio operations. I = vaguely > remember thinking that the place to make that happen was along the = lines > of doing the mapping in getblk() (or maybe breada()?) and unmapping it > in bdone(), but I was quite frankly lost in that twisty maze of code = and > never felt like I understood it well enough to even make an = experimental > stab at such changes. >=20 > I have two patches related to this stuff. They were generated from = 8.2 > sources but I've confirmed that they apply properly to -current. >=20 > One patch modifies map_object.c to use mmap()+memcpy() instead of > pread(). I think it's a useful enhancement even without its effect on > this icache problem, because it seems to me that doing a readahead on = a > shared library will bring in pages that may never be referenced and > wouldn't have required any physical memory or IO resources if the > readahead hadn't happened. >=20 > The other is a pure hack-workaround that's most helpful when you're > developing code for an arm platform. It forces on the O_DIRECT flag = in > ffs_write() (and optionally ffs_read() but that's disabled by default) > for executable files, to keep the blocks out of the buffer cache when > doing normal IO stuff. It's ugly brute force, but it's good enough to > let us develop and deploy embedded systems code using FreeBSD 8.2. = This > is not to be committed, this is just a workaround that let us start > using 8.2 before finding a real fix to the root problem. Anyone else > trying to work with 8.0 or later on VIVT-cache arm chips might find it > useful until a proper fix is developed. >=20 > -- Ian >=20 > --- sys/ufs/ffs/ffs_vnops.c Thu Jun 16 14:43:20 2011 -0600 > +++ sys/ufs/ffs/ffs_vnops.c Mon Jan 30 17:54:44 2012 -0700 > @@ -467,6 +467,18 @@ ffs_read(ap) > seqcount =3D ap->a_ioflag >> IO_SEQSHIFT; > ip =3D VTOI(vp); >=20 > + // This hack ensures that executable code never ends up in the = buffer cache. > + // It is currently disabled. > + // It helps work around disabled-icache due to kernel-writable = mappings. > + // However, shell script files are executable and caching them = is useful, so > + // this is disabled for now. With the rtld-elf mmap() patch in = place, > + // nothing normally ever calls read on an executable file so = this code > + // doesn't buy us much. > +#if 0 && defined(__arm__) > + if (vp->v_type =3D=3D VREG && ip->i_mode & IEXEC) > + ioflag |=3D IO_DIRECT; > +#endif =20 > + > #ifdef INVARIANTS > if (uio->uio_rw !=3D UIO_READ) > panic("ffs_read: mode"); > @@ -670,6 +682,17 @@ ffs_write(ap) > seqcount =3D ap->a_ioflag >> IO_SEQSHIFT; > ip =3D VTOI(vp); >=20 > + // This hack ensures that executable code never ends up in the = buffer cache. > + // It helps work around disabled-icache due to kernel-writable = mappings. > + // On a deployed production system, nothing normally ever calls = write() on > + // an executable file. This hack exists to allow development on = the system > + // (so that you can do things like copy a new executable onto = the system > + // without having that destroy performance on subsequent runs). > +#if defined(__arm__) > + if (vp->v_type =3D=3D VREG && ip->i_mode & IEXEC) > + ioflag |=3D IO_DIRECT; > +#endif > + > #ifdef INVARIANTS > if (uio->uio_rw !=3D UIO_WRITE) > panic("ffs_write: mode"); > diff -r 0cb0be36b70f libexec/rtld-elf/map_object.c > --- libexec/rtld-elf/map_object.c Thu Jun 16 14:43:20 2011 -0600 > +++ libexec/rtld-elf/map_object.c Mon Jan 30 20:03:45 2012 -0700 > @@ -272,11 +272,16 @@ get_elf_header (int fd, const char *path > char buf[PAGE_SIZE]; > } u; > ssize_t nbytes; > + void *mapped; >=20 > - if ((nbytes =3D pread(fd, u.buf, PAGE_SIZE, 0)) =3D=3D -1) { > - _rtld_error("%s: read error: %s", path, strerror(errno)); > + /* Use mmap() + memcpy() rather than [p]read() to avoid = readahead. */ > + nbytes =3D sizeof(u.buf); > + if ((mapped =3D mmap(NULL, nbytes, PROT_READ, 0, fd, 0)) =3D=3D = (caddr_t) -1) { > + _rtld_error("%s: mmap of header failed: %s", path, = strerror(errno)); > return NULL; > } > + memcpy(u.buf, mapped, nbytes); > + munmap(mapped, nbytes); >=20 > /* Make sure the file is valid */ > if (nbytes < (ssize_t)sizeof(Elf_Ehdr) || !IS_ELF(u.hdr)) { > _______________________________________________ > freebsd-arm@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arm > To unsubscribe, send any mail to "freebsd-arm-unsubscribe@freebsd.org"