From owner-freebsd-current Wed Dec 12 16:53:30 2001 Delivered-To: freebsd-current@freebsd.org Received: from monorchid.lemis.com (monorchid.lemis.com [192.109.197.75]) by hub.freebsd.org (Postfix) with ESMTP id 202E737B417; Wed, 12 Dec 2001 16:53:16 -0800 (PST) Received: by monorchid.lemis.com (Postfix, from userid 1004) id 804B3786E7; Thu, 13 Dec 2001 10:54:13 +1030 (CST) Date: Thu, 13 Dec 2001 10:54:13 +1030 From: Greg Lehey To: Bernd Walter Cc: Matthew Dillon , Wilko Bulte , Mike Smith , Terry Lambert , Joerg Wunsch , freebsd-current@FreeBSD.ORG Subject: Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c)) Message-ID: <20011213105413.G76019@monorchid.lemis.com> References: <200112101754.fBAHsRV01202@mass.dis.org> <200112101813.fBAIDKo47460@apollo.backplane.com> <20011210192251.A65380@freebie.xs4all.nl> <200112101830.fBAIU4w47648@apollo.backplane.com> <20011211110633.M63585@monorchid.lemis.com> <20011211031120.G11774@cicely8.cicely.de> <20011212162205.I82733@monorchid.lemis.com> <20011212125337.D15654@cicely8.cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011212125337.D15654@cicely8.cicely.de> User-Agent: Mutt/1.3.23i Organization: The FreeBSD Project Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.FreeBSD.org/ X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Wednesday, 12 December 2001 at 12:53:37 +0100, Bernd Walter wrote: > On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote: >> On Tuesday, 11 December 2001 at 3:11:21 +0100, Bernd Walter wrote: >>> striped: >>> If you have 512byte stripes and have 2 disks. >>> You access 64k which is put into 2 32k transactions onto the disk. >> >> Only if your software optimizes the transfers. There are reasons why >> it should not. Without optimization, you get 128 individual >> transfers. > > If the software does not we end with 128 transactions anyway, which is > not very good becuase of the overhead for each of them. Correct. > UFS does a more or less good job in doing this. Well, it requires a lot of moves. Vinum *could* do this, but for the reasons specified below, there's no need. >>> raid5: >>> For a write you have two read transactions and two writes. >> >> This is the way Vinum does it. There are other possibilities: >> >> 1. Always do full-stripe writes. Then you don't need to read the old >> contents. > > Which isn't that good with the big stripes we usually want. Correct. That's why most RAID controllers limit stripe size to something sub-optimal, because it simplifies the code to do full-stripe writes. >> 2. Cache the parity blocks. This is an optimization which I think >> would be very valuable, but which Vinum doesn't currently perform. > > I thought of connecting the parity to the wait lock. > If there's a waiter for the same parity data it's not droped. > This way we don't waste memory but still have an efect. That's a possibility, though it doesn't directly address parity block caching. The problem is that by the time you find another lock, you've already performed part of the parity calculation, and probably part of the I/O transfer. But it's an interesting consideration. >>> If we had a fine grained locking which only locks the accessed sectors >>> in the parity we would be able to have more than a single ascending >>> write transaction onto a single drive. >> >> Hmm. This is something I hadn't thought about. Note that sequential >> writes to a RAID-5 volume don't go to sequential addresses on the >> spindles; they will work up to the end of the stripe on one spindle, >> then start on the next spindle at the start of the stripe. You can do >> that as long as the address ranges in the parity block don't overlap, >> but the larger the stripe, the greater the likelihood of this would >> be. This might also explain the following observed behaviour: >> >> 1. RAID-5 writes slow down when the stripe size gets > 256 kB or so. >> I don't know if this happens on all disks, but I've seen it often >> enough. > > I would guess it when the stripe size is bigger than the preread cache > the drives uses. > This would mean we have a less chance to get parity data out of the > drive cache. Yes, this was one of the possibilities we considered. >> 2. rawio write performance is better than ufs write performance. >> rawio does "truly" random transfers, where ufs is a mixture. > > The current problem is to increase linear write performance. > I don't see a chance that rawio benefit of it, but ufs will. Well, rawio doesn't need to benefit. It's supposed to be a neutral observer, but in this case it's not doing too well. >> Do you feel like changing the locking code? It shouldn't be that much >> work, and I'd be interested to see how much performance difference it >> makes. > > I put it onto my todo list. Thanks. >> Note that there's another possible optimization here: delay the writes >> by a certain period of time and coalesce them if possible. I haven't >> finished thinking about the implications. > > That's exactly what the ufs clustering and softupdates does. > If it doesn't fit modern drives anymore it should get tuned there. This doesn't have too much to do with modern drives; it's just as applicable to 70s drives. > Whenever a write hits a driver there is a waiter for it. > Either a softdep, a memory freeing or an application doing an sync > transfer. > I'm almost shure delaying writes will harm performance in upper layers. I'm not so sure. Full stripe writes, where needed, are *much* faster than partial strip writes. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message