Date: Sun, 19 Feb 2006 13:07:10 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Andrew Gallatin <gallatin@cs.duke.edu> Cc: freebsd-amd64@FreeBSD.org Subject: Re: non-temporal copyin/copyout? Message-ID: <20060219115807.F99158@epsplex.bde.org> In-Reply-To: <17399.39290.13815.777894@grasshopper.cs.duke.edu> References: <17397.58669.457047.277510@grasshopper.cs.duke.edu> <20060218232213.F59482@delplex.bde.org> <17399.39290.13815.777894@grasshopper.cs.duke.edu>
index | next in thread | previous in thread | raw e-mail
On Sat, 18 Feb 2006, Andrew Gallatin wrote: > Bruce Evans writes: > > > A quick test in userspace shows that for large copies, an adapted > > > pagecopy (from amd64/amd64/support.S) more than doubles bcopy > > > bandwidth from 1.2GB/s to 2.5GB/s on my on my Athlon64 X2 3800+. > > > > Is this with 5+GHz memory or with slower memory with the source cached? > > I've seen 1.7GB/s in non-quick tests in user space with PC3200 memory > > overclocked slightly. This is almost twice as fast as using the best > > nontemporal copy method (which gives 0.9GB/s on the same machine). > > This is a "DFI Lanparty UTnF4 Ultra-D" with an Nforce 4 chipset, and 2 > 256 MB sticks of PC3200 ram. The timings I mention above closely > match the lmbench "bcopy" benchmark for large buffers (> L2 cache) > when run on FreeBSD vs when run on Solaris (which uses a non-temporal > bcopy even in userspace). The 2.5GB/s is for the source already cached then. I'm familiar with lmbench. At least in lmbench2, the default pipe benchmark goes at nearly main memory bandwidth on FreeBSD (because the source is cached and half of the copying is virtual), and the default bcopy benchmark goes at about 1/4 of the main memory bandwidth (wasting half the bandwidth) because it is too big for the (L2) cache but the cache is used. > <....> > > > With the Athlon64 behaviour, I think nontemporal copies should only be > > used in cases where it is know that the copies really are nontemporal. > > We use them for page copying now because this is (almost) known. For > > copyout(), it would be certainly known only for copies that are so large > > that they can't fit in the L2 cache. copyin() might be different, since > > it might often be known that the data will be DMA'ed out by a driver and > > need never be cached. > > I think you could make arguments for doing a non-temporal copy for > both copyin and copyout when the size exceeds some tunable threshold. > Solaris even uses a fixed threshold, and I believe the threshold is > quite small (128 bytes). See > http://cvs.opensolaris.org/source/xref/on/usr/src/uts/intel/ia32/ml/copy.s Hmm, that seems far too small. You could make it a sysctl tunable. > Maybe I'm being naive, but I would assume that most bulk data, both > copied in and copied out should never be accessed by the kernel in a > high performance system. Most Gigabit or better, and many 100Mb > network drivers do checksum offloading on both send and receive, so > there is no need for the kernel to touch any data which is copied in > or out for network sends or receives. Further, I can imagine a > network server (like a userspace nfs server or samba) turning around > and writing data to disk which it received via a socket read without > ever looking at the buffer. > > I don't know the storage system as well as the networking system, but > unless a disk driver is using PIO, I don't think the data is ever > touched by the kernel. read()/write() to disk files still always gives through the buffer cache and uses uiomove() and thus copyin/out() to get there. Thus the best method of reading from a socket and write(2)ing to a disk is almost certainly to use a buffer small enough to fit several times in the L2 cache and stay there, and temporal copyout() but temporal copyin() so that the copyout() from the socket buffer prepares for soon rereading the data in the copyin() but the copyin() doesn't prepare for rereading (since the disk driver should use DMA and not do the write for ~30 seconds anyway and it is not expected that the data be otherwise read from the buffer cache). If the application writes the data using m[un]map() and doesn't access it directly, then nontemporal copyout()s seem to be better than than temporaral ones. Even if the vm system copies the data later (I think it doesn't), then the data is likely to have gone out of the L2 cache (if the copyout() put it there) by the time vm gets around to writing it. Aren't you supposed to use ZERO_COPY_SOCKETS to avoid all copying for socket buffers? Brucehelp
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060219115807.F99158>
