From owner-freebsd-current Wed Dec 12 18: 6:31 2001 Delivered-To: freebsd-current@freebsd.org Received: from srv1.cosmo-project.de (srv1.cosmo-project.de [213.83.6.106]) by hub.freebsd.org (Postfix) with ESMTP id 32A0D37B419; Wed, 12 Dec 2001 18:06:23 -0800 (PST) Received: (from uucp@localhost) by srv1.cosmo-project.de (8.11.0/8.11.0) with UUCP id fBD26Fw88194; Thu, 13 Dec 2001 03:06:15 +0100 (CET) Received: from mail.cicely.de (cicely20.cicely.de [10.1.1.22]) by cicely5.cicely.de (8.12.1/8.12.1) with ESMTP id fBD26Ntx019121; Thu, 13 Dec 2001 03:06:23 +0100 (CET)?g (envelope-from ticso@cicely8.cicely.de) Received: from cicely8.cicely.de (cicely8.cicely.de [10.1.2.10]) by mail.cicely.de (8.11.0/8.11.0) with ESMTP id fBD26MW08858; Thu, 13 Dec 2001 03:06:22 +0100 (CET) Received: (from ticso@localhost) by cicely8.cicely.de (8.11.6/8.11.6) id fBD26EQ20074; Thu, 13 Dec 2001 03:06:14 +0100 (CET) (envelope-from ticso) Date: Thu, 13 Dec 2001 03:06:14 +0100 From: Bernd Walter To: Greg Lehey Cc: Matthew Dillon , Wilko Bulte , Mike Smith , Terry Lambert , Joerg Wunsch , freebsd-current@FreeBSD.org Subject: Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c)) Message-ID: <20011213030613.A18679@cicely8.cicely.de> References: <200112101754.fBAHsRV01202@mass.dis.org> <200112101813.fBAIDKo47460@apollo.backplane.com> <20011210192251.A65380@freebie.xs4all.nl> <200112101830.fBAIU4w47648@apollo.backplane.com> <20011211110633.M63585@monorchid.lemis.com> <20011211031120.G11774@cicely8.cicely.de> <20011212162205.I82733@monorchid.lemis.com> <20011212125337.D15654@cicely8.cicely.de> <20011213105413.G76019@monorchid.lemis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011213105413.G76019@monorchid.lemis.com> User-Agent: Mutt/1.3.23i X-Operating-System: FreeBSD cicely8.cicely.de 5.0-CURRENT i386 Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Thu, Dec 13, 2001 at 10:54:13AM +1030, Greg Lehey wrote: > On Wednesday, 12 December 2001 at 12:53:37 +0100, Bernd Walter wrote: > > On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote: > >> On Tuesday, 11 December 2001 at 3:11:21 +0100, Bernd Walter wrote: > >> 2. Cache the parity blocks. This is an optimization which I think > >> would be very valuable, but which Vinum doesn't currently perform. > > > > I thought of connecting the parity to the wait lock. > > If there's a waiter for the same parity data it's not droped. > > This way we don't waste memory but still have an efect. > > That's a possibility, though it doesn't directly address parity block > caching. The problem is that by the time you find another lock, > you've already performed part of the parity calculation, and probably > part of the I/O transfer. But it's an interesting consideration. I know that it doesn't do the best, but it's easy to implement. A more complex handling for the better results can still be done. > >>> If we had a fine grained locking which only locks the accessed sectors > >>> in the parity we would be able to have more than a single ascending > >>> write transaction onto a single drive. > >> > >> Hmm. This is something I hadn't thought about. Note that sequential > >> writes to a RAID-5 volume don't go to sequential addresses on the > >> spindles; they will work up to the end of the stripe on one spindle, > >> then start on the next spindle at the start of the stripe. You can do > >> that as long as the address ranges in the parity block don't overlap, > >> but the larger the stripe, the greater the likelihood of this would > >> be. This might also explain the following observed behaviour: > >> > >> 1. RAID-5 writes slow down when the stripe size gets > 256 kB or so. > >> I don't know if this happens on all disks, but I've seen it often > >> enough. > > > > I would guess it when the stripe size is bigger than the preread cache > > the drives uses. > > This would mean we have a less chance to get parity data out of the > > drive cache. > > Yes, this was one of the possibilities we considered. It should be measured and compared after I changed the looking. It will look different after that and may lead to other reasons, because we will have a different load characteristic on the drives. Currently if we have two writes in two stripes each, all initated before the first finished, the drive has to seek between the two stripes, as the second write to the same stripe has to wait. > >> Note that there's another possible optimization here: delay the writes > >> by a certain period of time and coalesce them if possible. I haven't > >> finished thinking about the implications. > > > > That's exactly what the ufs clustering and softupdates does. > > If it doesn't fit modern drives anymore it should get tuned there. > > This doesn't have too much to do with modern drives; it's just as > applicable to 70s drives. One of softupdates job is to eliminate redundant writes and to do async writes without loosing consistency of the on media structure. This also means that we have a better chance that data is written in big chunks. In general the wire speed of data to the drive is increased with every new bus generation but usually big parts of the overhead is keeped for compatibility with older drives. I agree that the parity based raid situation does depend more on principle than on the age of the drive. > > Whenever a write hits a driver there is a waiter for it. > > Either a softdep, a memory freeing or an application doing an sync > > transfer. > > I'm almost shure delaying writes will harm performance in upper layers. > > I'm not so sure. Full stripe writes, where needed, are *much* faster > than partial strip writes. Hardware raid usually comes with NVRAM and can cache write data without delaying the acklowledge to the initiator. That option is not available to software raid. -- B.Walter COSMO-Project http://www.cosmo-project.de ticso@cicely.de Usergroup info@cosmo-project.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message