From owner-freebsd-performance@FreeBSD.ORG Thu Nov 25 09:50:16 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F3291106564A for ; Thu, 25 Nov 2010 09:50:15 +0000 (UTC) (envelope-from yar.tikhiy@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 8D7478FC16 for ; Thu, 25 Nov 2010 09:50:15 +0000 (UTC) Received: by wyf19 with SMTP id 19so708294wyf.13 for ; Thu, 25 Nov 2010 01:50:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=NPhOp55UigYWIhd8Qfz8LbuP3WsklTO4Yb3dz+odlck=; b=ZVMRjDGij0E29qhRceVCFrAgHgu3aR5lHrSQGG7mDTf43NSf5lfFkQL7iEu0TsTy3S sUNXIX7XcXri+dGaHc1Z8Ck75biBwpbo1w5uMZDnNo8w5B5HcIIkHCImAM2ryxMIS0sN 5E+LUQTlSjuTSrjv1hLx3KDVZ/BQ3GMaFbjA4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=eYOyXw1Xn7VT8vlKwHeO5GNapOT2uHW1D2b0fTQ1VDJot7l8QZLgC9iKD/k4FANveh 94vY7IlLZ8ZiEAQqFzGMOSeOWsJ/xSfqB20zfFVSdgWZjBt76gv5MITddK2H0n66qF34 1dyCE1NI0RigCD76D/ayqrWRcZumjykd+WjY0= MIME-Version: 1.0 Received: by 10.227.132.137 with SMTP id b9mr531094wbt.48.1290676835234; Thu, 25 Nov 2010 01:20:35 -0800 (PST) Received: by 10.227.127.143 with HTTP; Thu, 25 Nov 2010 01:20:35 -0800 (PST) Date: Thu, 25 Nov 2010 20:20:35 +1100 Message-ID: From: Yar Tikhiy To: freebsd-performance@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Subject: Poor RAID performance demystified X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Nov 2010 09:50:16 -0000 Hi all, This issue has been raised periodically on various lists and forums and I myself recently ran into it so I feel that I should just post my findings here. Every now and then somebody complains about extremely poor RAID performance. What is common in those reports is that they usually mention FreeBSD and HP RAID controllers, and all of them are about load patterns from Postgresql. We are just about to see why it's so. People get surprisingly low disk I/O performance (e.g., 1-2MB/s) in spite of numerous spindles striped in the array when the benchmark involves a lot of tiny DB transactions. On the same array, sequential read and write rates can be more than satisfactory. That happens just because Postgresql in its default configuration is *remarkably* stringent about flushing every transaction out to the disk before proceeding to the next one. The PG folks know that well. But, as it is known from practice, just application flushing data wouldn't be sufficient for this effect to be so pronounced. What _might_ be happening here is that HP RAIDs as driven by FreeBSD do fully comply with flush requests all the way down the disk stack whereas other popular RAID / OS combos can effectively ignore them to a certain extent due to latent write-back caching, e.g., that in the drives. Why does striping fail to speed the things up? Just because the transactions are tiny and every disk write ends up blocked waiting for a single spindle to handle it. No striping can speed up 8K or 16K synchronous writes because they are seek limited, not bandwidth limited. (Likewise, no RAID or cache can speed up highly random reads just a few blocks each as reads are synchronous by their nature just because you can't know the data before it has been read in.) It is easy to check if you are hitting this kind of bottleneck. While running your benchmark, watch the output from iostat or systat -vm or gstat. The average I/O size will closely match the FS block size (the default is 16K now on FFS) and the tps (transfers per second) value will be quite close to your disks RPM rate expressed in revs per second. E.g., with 10K RPM disks you are going to get 10000 / 60 = ~170 tps and with 15K RPM disks it'll be around 250 tps. You are just hitting very basic laws of nature and logic here. The final question will be, of course, what to do about this issue. First of all, make up your mind if 150 or 200 write transactions per second aren't going to be enough for your task. Your actual load pattern can be quite different from that in the benchmark. If you still need greater write performance on tiny transactions, consider getting a battery backup unit (BBU) for your RAID adapter. Quite remarkably, HP refer to them as "Write-back Cache Enablers" because installing one is the only way to get an HP RAID adapter do write-back caching. A write-back cache with BBU will let the adapter delay and coalesce tiny writes without jeopardizing the DB integrity. However, you'll need to trust your BBU as your DB integrity will be staked on it (the PG folks are somehow skeptical about BBUs). On the other hand, just fiddling with the PG settings to disable transaction flushing is a certain recipe for disaster. Fortunately, there is a trade-off mode in PG where it does transaction coalescing by itself -- search for synchronous_commit. The downside of it is that, should the system crash, a few most recent transactions can be lost after they were reported as successful to the SQL client. That can be OK or not OK depending on the task, and synchronous_commit can be toggled on per session or per transaction basis to finely tune the trade-off. That's it, folks. Thanks, Yar