Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 May 2011 01:25:03 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        mdf@FreeBSD.org
Cc:        svn-src-head@FreeBSD.org, Pieter de Goeje <pieter@degoeje.nl>, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org
Subject:   Re: svn commit: r221853 - in head/sys: dev/md dev/null sys vm
Message-ID:  <20110531004247.C4034@besplex.bde.org>
In-Reply-To: <BANLkTimJY65boMPhnnT344cmwRUJ0Z=dSQ@mail.gmail.com>
References:  <201105131848.p4DIm1j7079495@svn.freebsd.org> <201105282103.43370.pieter@degoeje.nl> <BANLkTimJY65boMPhnnT344cmwRUJ0Z=dSQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-64904466-1306769103=:4034
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Sat, 28 May 2011 mdf@FreeBSD.org wrote:

> On Sat, May 28, 2011 at 12:03 PM, Pieter de Goeje <pieter@degoeje.nl> wro=
te:
>> On Friday 13 May 2011 20:48:01 Matthew D Fleming wrote:
>>> Author: mdf
>>> Date: Fri May 13 18:48:00 2011
>>> New Revision: 221853
>>> URL: http://svn.freebsd.org/changeset/base/221853
>>>
>>> Log:
>>> =A0 Usa a globally visible region of zeros for both /dev/zero and the m=
d
>>> =A0 device. =A0There are likely other kernel uses of "blob of zeros" th=
an can
>>> =A0 be converted.
>>>
>>> =A0 Reviewed by: =A0 =A0 =A0 =A0alc
>>> =A0 MFC after: =A01 week
>>
>> This change seems to reduce /dev/zero performance by 68% as measured by =
this
>> command: dd if=3D/dev/zero of=3D/dev/null bs=3D64k count=3D100000.
>>
>> x dd-8-stable
>> + dd-9-current
>> +-----------------------------------------------------------------------=
--+
>> |+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
|

Argh, hard \xa0.

[...binary garbage deleted]

>> This particular measurement was against 8-stable but the results are the=
 same
>> for -current just before this commit. Basically througput drops from
>> ~13GB/sec to 4GB/sec.
>>
>> Hardware is a Phenom II X4 945 with 8GB of 800Mhz DDR2 memory. FreeBSD/a=
md64
>> is installed. This processor has 6MB of L3 cache.
>>
>> To me it looks like it's not able to cache the zeroes anymore. Is this
>> intentional? I tried to change ZERO_REGION_SIZE back to 64K but that did=
n't
>> help.
>
> Hmm.  I don't have access to my FreeBSD box over the weekend, but I'll
> run this on my box when I get back to work.
>
> Meanwhile you could try setting ZERO_REGION_SIZE to PAGE_SIZE and I
> think that will restore things to the original performance.

Using /dev/zero always thrashes caches by the amount <source buffer
size> + <target buffer size> (unless the arch uses nontemporal memory
accesses for uiomove, which none do AFAIK).  So a large source buffer
is always just a pessimization.  A large target buffer size is also a
pessimization, but for the target buffer a fairly large size is needed
to amortize the large syscall costs.  In this PR, the target buffer
size is 64K.  ZERO_REGION_SIZE is 64K on i386 and 2M on amd64.  64K+64K
on i386 is good for thrashing the L1 cache.  It will only have a
noticeable impact on a current L2 cache in competition with other
threads.  It is hard to fit everything in the L1 cache even with
non-bloated buffer sizes and 1 thread (16 for the source (I)cache, 0
for the source (D)cache and 4K for the target cache might work).  On
amd64, 2M+2M is good for thrashing most L2 caches.  In this PR, the
thrashing is limited by the target buffer size to about 64K+64K, up
from 4K+64K, and it is marginal whether the extra thrashing from the
larger source buffer makes much difference.

The old zbuf source buffer size of PAGE_SIZE was already too large.
The source buffer size only needs to be large enough to amortize
loop overhead.  1 cache line is enough in most cases.  uiomove()
and copyout() unfortunately don't support copying from register
space, so there must be a source buffer.  This may limit the bandwidth
by a factor of 2 in some cases, since most modern CPUs can execute
either 2 64-bit stores or 1 64-bit store and 1 64-bit load per cycle
if everything is already in the L1 cache.  However, target buffers
for /dev/zero (or any user i/o) probably need to be larger than the
L1 cache to amortize the syscall overhead, so there are usually plenty
of cycles to spare for the unnecessary loads while the stores wait for
caches.

This behaviour is easy to see for regular files too (regular files get
copied out from the buffer cache).  You have limited control on the
amount of thrashing by changing the target buffer size, and can determine
cache sizes by looking at throughputs.

Bruce
--0-64904466-1306769103=:4034--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110531004247.C4034>