From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 18:51:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9CB741065672 for ; Sun, 5 Jul 2009 18:51:13 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id DC3CE8FC15 for ; Sun, 5 Jul 2009 18:51:12 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247694239; Sun, 05 Jul 2009 21:51:09 +0300 Message-ID: <4A50F619.4020101@FreeBSD.org> Date: Sun, 05 Jul 2009 21:51:05 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Bruce Evans References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> In-Reply-To: <20090706034250.C2240@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 18:51:13 -0000 Bruce Evans wrote: > On Sun, 5 Jul 2009, Alexander Motin wrote: > >> Bruce Evans wrote: >>> I was thinking more of transfers to userland. Increasing user buffer >>> sizes above about half the L2 cache size guarantees busting the L2 >>> cache, if the application actually looks at all of its data. If the >>> data is read using read(), then the L2 cache will be busted twice (or >>> a bit less with nontemporal copying), first by copying out the data >>> and then by looking at it. If the data is read using mmap(), then the >>> L2 cache will only be busted once. This effect has always been very >>> noticeable using dd. Larger buffer sizes are also bad for latency. >> ... >> How to reproduce that dd experiment? I have my system running with >> MAXPHYS of 512K and here is what I have: > > I used a regular file with the same size as main memory (1G), and for > today's test, not quite dd, but a program that throws away the data > (so as to avoid overcall for write syscalls) and prints status info > in a more suitable form than even dd's ^T. > > Your results show that physio() behaves quite differently than copying > reading a regular file. I see similar behaviour input from a disk file. > >> # dd if=/dev/ada0 of=/dev/null bs=512k count=1000 >> 1000+0 records in >> 1000+0 records out >> 524288000 bytes transferred in 2.471564 secs (212128024 bytes/sec) > > 512MB would be too small with buffering for a regular file, but should > be OK with a disk file. > >> # dd if=/dev/ada0 of=/dev/null bs=256k count=2000 >> 2000+0 records in >> 2000+0 records out >> 524288000 bytes transferred in 2.666643 secs (196609752 bytes/sec) >> # dd if=/dev/ada0 of=/dev/null bs=128k count=4000 >> 4000+0 records in >> 4000+0 records out >> 524288000 bytes transferred in 2.759498 secs (189993969 bytes/sec) >> # dd if=/dev/ada0 of=/dev/null bs=64k count=8000 >> 8000+0 records in >> 8000+0 records out >> 524288000 bytes transferred in 2.718900 secs (192830927 bytes/sec) >> >> CPU load instead grows from 10% at 512K to 15% at 64K. May be trashing >> effect will only be noticeable at block comparable to cache size, but >> modern CPUs have megabytes of cache. > > I used systat -v to estimate the load. Its average jumps around more > than I > like, but I don't have anything better. Sys time from dd and others is > even > more useless than it used to be since lots of the i/o runs in threads and > the system doesn't know how to charge the application for thread time. > > My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2 > de-geomed): > > regular file: > > block size %idle > ---------- ----- > 1M 87 > 16K 91 > 4K 88 (?) > 512 72 (?) > > disk file: > > block size %idle > ---------- ----- > 1M 96 > 64K 96 > 32K 93 > 16K 87 > 8K 82 (firmware can't keep up and rate drops to 37MB/S) > > In the case of the regular file, almost all i/o is clustered so the driver > sees mainly the cluster size (driver max size of 64K before geom). Upper > layers then do a good job of only adding a few percent CPU when > declustering > to 16K fs-blocks. In this tests you've got almost only negative side of effect, as you have said, due to cache misses. Do you really have CPU with so small L2 cache? Some kind of P3 or old Celeron? But with 64K MAXPHYS you just didn't get any benefit from using bigger block size. -- Alexander Motin