Date: Mon, 30 May 2011 08:49:15 -0700 From: mdf@FreeBSD.org To: Bruce Evans <brde@optusnet.com.au> Cc: svn-src-head@freebsd.org, Pieter de Goeje <pieter@degoeje.nl>, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r221853 - in head/sys: dev/md dev/null sys vm Message-ID: <BANLkTiknnuoC1hU6YD5H%2BSmCU1zP3zrv1A@mail.gmail.com> In-Reply-To: <20110531004247.C4034@besplex.bde.org> References: <201105131848.p4DIm1j7079495@svn.freebsd.org> <201105282103.43370.pieter@degoeje.nl> <BANLkTimJY65boMPhnnT344cmwRUJ0Z=dSQ@mail.gmail.com> <20110531004247.C4034@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, May 30, 2011 at 8:25 AM, Bruce Evans <brde@optusnet.com.au> wrote: > On Sat, 28 May 2011 mdf@FreeBSD.org wrote: > >> On Sat, May 28, 2011 at 12:03 PM, Pieter de Goeje <pieter@degoeje.nl> >> wrote: >>> >>> On Friday 13 May 2011 20:48:01 Matthew D Fleming wrote: >>>> >>>> Author: mdf >>>> Date: Fri May 13 18:48:00 2011 >>>> New Revision: 221853 >>>> URL: http://svn.freebsd.org/changeset/base/221853 >>>> >>>> Log: >>>> =A0 Usa a globally visible region of zeros for both /dev/zero and the = md >>>> =A0 device. =A0There are likely other kernel uses of "blob of zeros" t= han >>>> can >>>> =A0 be converted. >>>> >>>> =A0 Reviewed by: =A0 =A0 =A0 =A0alc >>>> =A0 MFC after: =A01 week >>> >>> This change seems to reduce /dev/zero performance by 68% as measured by >>> this >>> command: dd if=3D/dev/zero of=3D/dev/null bs=3D64k count=3D100000. >>> >>> x dd-8-stable >>> + dd-9-current >>> >>> +----------------------------------------------------------------------= ---+ >>> |+ >>> =A0| > > Argh, hard \xa0. > > [...binary garbage deleted] > >>> This particular measurement was against 8-stable but the results are th= e >>> same >>> for -current just before this commit. Basically througput drops from >>> ~13GB/sec to 4GB/sec. >>> >>> Hardware is a Phenom II X4 945 with 8GB of 800Mhz DDR2 memory. >>> FreeBSD/amd64 >>> is installed. This processor has 6MB of L3 cache. >>> >>> To me it looks like it's not able to cache the zeroes anymore. Is this >>> intentional? I tried to change ZERO_REGION_SIZE back to 64K but that >>> didn't >>> help. >> >> Hmm. =A0I don't have access to my FreeBSD box over the weekend, but I'll >> run this on my box when I get back to work. >> >> Meanwhile you could try setting ZERO_REGION_SIZE to PAGE_SIZE and I >> think that will restore things to the original performance. > > Using /dev/zero always thrashes caches by the amount <source buffer > size> + <target buffer size> (unless the arch uses nontemporal memory > accesses for uiomove, which none do AFAIK). =A0So a large source buffer > is always just a pessimization. =A0A large target buffer size is also a > pessimization, but for the target buffer a fairly large size is needed > to amortize the large syscall costs. =A0In this PR, the target buffer > size is 64K. =A0ZERO_REGION_SIZE is 64K on i386 and 2M on amd64. =A064K+6= 4K > on i386 is good for thrashing the L1 cache. That depends -- is the cache virtually or physically addressed? The zero_region only has 4k (PAGE_SIZE) of unique physical addresses. So most of the cache thrashing is due to the user-space buffer, if the cache is physically addressed. =A0It will only have a > noticeable impact on a current L2 cache in competition with other > threads. =A0It is hard to fit everything in the L1 cache even with > non-bloated buffer sizes and 1 thread (16 for the source (I)cache, 0 > for the source (D)cache and 4K for the target cache might work). =A0On > amd64, 2M+2M is good for thrashing most L2 caches. =A0In this PR, the > thrashing is limited by the target buffer size to about 64K+64K, up > from 4K+64K, and it is marginal whether the extra thrashing from the > larger source buffer makes much difference. > > The old zbuf source buffer size of PAGE_SIZE was already too large. Wouldn't this depend on how far down from the use of the buffer the actual copy happens? Another advantage to a large virtual buffer is that it reduces the number of times the copy loop in uiomove has to return up to the device layer that initiated the copy. This is all pretty fast, but again assuming a physical cache fewer trips is better. Thanks, matthew > The source buffer size only needs to be large enough to amortize > loop overhead. =A01 cache line is enough in most cases. =A0uiomove() > and copyout() unfortunately don't support copying from register > space, so there must be a source buffer. =A0This may limit the bandwidth > by a factor of 2 in some cases, since most modern CPUs can execute > either 2 64-bit stores or 1 64-bit store and 1 64-bit load per cycle > if everything is already in the L1 cache. =A0However, target buffers > for /dev/zero (or any user i/o) probably need to be larger than the > L1 cache to amortize the syscall overhead, so there are usually plenty > of cycles to spare for the unnecessary loads while the stores wait for > caches. > > This behaviour is easy to see for regular files too (regular files get > copied out from the buffer cache). =A0You have limited control on the > amount of thrashing by changing the target buffer size, and can determine > cache sizes by looking at throughputs. > > Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTiknnuoC1hU6YD5H%2BSmCU1zP3zrv1A>