From owner-freebsd-current Tue Dec 11 21:52:26 2001 Delivered-To: freebsd-current@freebsd.org Received: from monorchid.lemis.com (monorchid.lemis.com [192.109.197.75]) by hub.freebsd.org (Postfix) with ESMTP id 293D137B405; Tue, 11 Dec 2001 21:52:13 -0800 (PST) Received: by monorchid.lemis.com (Postfix, from userid 1004) id 2EAFD786E8; Wed, 12 Dec 2001 16:22:05 +1030 (CST) Date: Wed, 12 Dec 2001 16:22:05 +1030 From: Greg Lehey To: Bernd Walter Cc: Matthew Dillon , Wilko Bulte , Mike Smith , Terry Lambert , Joerg Wunsch , freebsd-current@FreeBSD.ORG Subject: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c)) Message-ID: <20011212162205.I82733@monorchid.lemis.com> References: <200112101754.fBAHsRV01202@mass.dis.org> <200112101813.fBAIDKo47460@apollo.backplane.com> <20011210192251.A65380@freebie.xs4all.nl> <200112101830.fBAIU4w47648@apollo.backplane.com> <20011211110633.M63585@monorchid.lemis.com> <20011211031120.G11774@cicely8.cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20011211031120.G11774@cicely8.cicely.de> User-Agent: Mutt/1.3.23i Organization: The FreeBSD Project Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.FreeBSD.org/ X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tuesday, 11 December 2001 at 3:11:21 +0100, Bernd Walter wrote: > On Tue, Dec 11, 2001 at 11:06:33AM +1030, Greg Lehey wrote: >> On Monday, 10 December 2001 at 10:30:04 -0800, Matthew Dillon wrote: >>> >>>>> performance without it - for reading OR writing. It doesn't matter >>>>> so much for RAID{1,10}, but it matters a whole lot for something like >>>>> RAID-5 where the difference between a spindle-synced read or write >>>>> and a non-spindle-synched read or write can be upwards of 35%. >>>> >>>> If you have RAID5 with I/O sizes that result in full-stripe operations. >>> >>> Well, 'more then one disk' operations anyway, for random-I/O. Caching >>> takes care of sequential I/O reasonably well but random-I/O goes down >>> the drain for writes if you aren't spindle synced, no matter what >>> the stripe size, >> >> Can you explain this? I don't see it. In FreeBSD, just about all I/O >> goes to buffer cache. > > After waiting for the drives and not for vinum parity blocks. > >>> and will go down the drain for reads if you cross a stripe - >>> something that is quite common I think. >> >> I think this is what Mike was referring to when talking about parity >> calculation. In any case, going across a stripe boundary is not a >> good idea, though of course it can't be avoided. That's one of the >> arguments for large stripes. > > striped: > If you have 512byte stripes and have 2 disks. > You access 64k which is put into 2 32k transactions onto the disk. Only if your software optimizes the transfers. There are reasons why it should not. Without optimization, you get 128 individual transfers. > The wait time for the complete transaction is the worst of both, > which is more than the average of a single disk. Agreed. > With spindle syncronisation the access time for both disks are > beleaved to be identic and you get the same as with a single disk. Correct. > Linear speed could be about twice the speed of a single drive. But > this is more theoretic today than real. The average transaction > size per disk decreases with growing number of spindles and you get > more transaction overhead. Also the voice coil technology used in > drives since many years add a random amount of time to the access > time, which invalidates some of the spindle sync potential. Plus it > may break some benefits of precaching mechanisms in drives. I'm > almost shure there is no real performance gain with modern drives. The real problem with this scenario is that you're missing a couple of points: 1. Typically it's not the latency that matters. If you have to wait a few ms longer, that's not important. What's interesting is the case of a heavily loaded system, where the throughput is much more important than the latency. 2. Throughput is the data transferred per unit time. There's active transfer time, nowadays in the order of 500 µs, and positioning time, in the order of 6 ms. Clearly the fewer positioning operations, the better. This means that you should want to put most transfers on a single spindle, not a single stripe. To do this, you need big stripes. > raid5: > For a write you have two read transactions and two writes. This is the way Vinum does it. There are other possibilities: 1. Always do full-stripe writes. Then you don't need to read the old contents. 2. Cache the parity blocks. This is an optimization which I think would be very valuable, but which Vinum doesn't currently perform. > There are easier things to raise performance. > Ever wondered why people claim vinums raid5 writes are slow? > The answer is astonishing simple: > Vinum does striped based locking, while the ufs tries to lay out data > mostly ascending sectors. > What happens here is that the first write has to wait for two reads > and two writes. > If we have an ascending write it has to wait for the first write to > finish, because the stripe is still locked. > The first is unlocked after both physical writes are on disk. > Now we start our two reads which are (thanks to drives precache) > most likely in the drives cache - than we write. > > The problem here is that physical writes gets serialized and the drive > has to wait a complete rotation between each. Not if the data is in the drive cache. > If we had a fine grained locking which only locks the accessed sectors > in the parity we would be able to have more than a single ascending > write transaction onto a single drive. Hmm. This is something I hadn't thought about. Note that sequential writes to a RAID-5 volume don't go to sequential addresses on the spindles; they will work up to the end of the stripe on one spindle, then start on the next spindle at the start of the stripe. You can do that as long as the address ranges in the parity block don't overlap, but the larger the stripe, the greater the likelihood of this would be. This might also explain the following observed behaviour: 1. RAID-5 writes slow down when the stripe size gets > 256 kB or so. I don't know if this happens on all disks, but I've seen it often enough. 2. rawio write performance is better than ufs write performance. rawio does "truly" random transfers, where ufs is a mixture. Do you feel like changing the locking code? It shouldn't be that much work, and I'd be interested to see how much performance difference it makes. Note that there's another possible optimization here: delay the writes by a certain period of time and coalesce them if possible. I haven't finished thinking about the implications. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message