From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 18:32:15 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 424C7106566C; Sun, 5 Jul 2009 18:32:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 99B0C8FC14; Sun, 5 Jul 2009 18:32:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-161-96.carlnfd1.nsw.optusnet.com.au [122.106.161.96]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n65IWBpC005294 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 6 Jul 2009 04:32:13 +1000 Date: Mon, 6 Jul 2009 04:32:11 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alexander Motin In-Reply-To: <4A50DEE8.6080406@FreeBSD.org> Message-ID: <20090706034250.C2240@besplex.bde.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 18:32:15 -0000 On Sun, 5 Jul 2009, Alexander Motin wrote: > Bruce Evans wrote: >> I was thinking more of transfers to userland. Increasing user buffer >> sizes above about half the L2 cache size guarantees busting the L2 >> cache, if the application actually looks at all of its data. If the >> data is read using read(), then the L2 cache will be busted twice (or >> a bit less with nontemporal copying), first by copying out the data >> and then by looking at it. If the data is read using mmap(), then the >> L2 cache will only be busted once. This effect has always been very >> noticeable using dd. Larger buffer sizes are also bad for latency. > ... > How to reproduce that dd experiment? I have my system running with MAXPHYS of > 512K and here is what I have: I used a regular file with the same size as main memory (1G), and for today's test, not quite dd, but a program that throws away the data (so as to avoid overcall for write syscalls) and prints status info in a more suitable form than even dd's ^T. Your results show that physio() behaves quite differently than copying reading a regular file. I see similar behaviour input from a disk file. > # dd if=/dev/ada0 of=/dev/null bs=512k count=1000 > 1000+0 records in > 1000+0 records out > 524288000 bytes transferred in 2.471564 secs (212128024 bytes/sec) 512MB would be too small with buffering for a regular file, but should be OK with a disk file. > # dd if=/dev/ada0 of=/dev/null bs=256k count=2000 > 2000+0 records in > 2000+0 records out > 524288000 bytes transferred in 2.666643 secs (196609752 bytes/sec) > # dd if=/dev/ada0 of=/dev/null bs=128k count=4000 > 4000+0 records in > 4000+0 records out > 524288000 bytes transferred in 2.759498 secs (189993969 bytes/sec) > # dd if=/dev/ada0 of=/dev/null bs=64k count=8000 > 8000+0 records in > 8000+0 records out > 524288000 bytes transferred in 2.718900 secs (192830927 bytes/sec) > > CPU load instead grows from 10% at 512K to 15% at 64K. May be trashing effect > will only be noticeable at block comparable to cache size, but modern CPUs > have megabytes of cache. I used systat -v to estimate the load. Its average jumps around more than I like, but I don't have anything better. Sys time from dd and others is even more useless than it used to be since lots of the i/o runs in threads and the system doesn't know how to charge the application for thread time. My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2 de-geomed): regular file: block size %idle ---------- ----- 1M 87 16K 91 4K 88 (?) 512 72 (?) disk file: block size %idle ---------- ----- 1M 96 64K 96 32K 93 16K 87 8K 82 (firmware can't keep up and rate drops to 37MB/S) In the case of the regular file, almost all i/o is clustered so the driver sees mainly the cluster size (driver max size of 64K before geom). Upper layers then do a good job of only adding a few percent CPU when declustering to 16K fs-blocks. In the case of the disk file, I can't explain why the overhead is so low (~0.5% intr 3.5% sys) for large block sizes. Uncached copies on the test machine go at 850MB/S so 50MB/S should take 1/19 of the CPU or 5.3%. Another difference with the disk file test is that physio() uses a single pbuf so the test doesn't thrash the buffer cache's memory. dd of a large regular file will thrash the L2 cache even if the user buffer size is small, but still goes faster with a smaller user buffer since the user buffer stays cached. Faster disks will of course want larger block sizes. I'm still suprised that this makes more difference to CPU than throughput. Maybe it doesn't really, but the measurement becomes differently accurate when the CPU becomes more loaded. At 100% load there would be nowhere to hide things like speculative cache fetches. Bruce