Date: Sat, 13 Nov 1999 21:33:25 -0500 From: Greg Lehey <grog@mojave.sitaranetworks.com> To: Bernd Walter <ticso@cicely.de>, Mattias Pantzare <pantzer@ludd.luth.se> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <19991113213325.57908@mojave.sitaranetworks.com> In-Reply-To: <19991106183316.A9420@cicely7.cicely.de>; from Bernd Walter on Sat, Nov 06, 1999 at 06:33:16PM %2B0100 References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Saturday, 6 November 1999 at 18:33:16 +0100, Bernd Walter wrote: > On Sat, Nov 06, 1999 at 06:16:47PM +0100, Mattias Pantzare wrote: >>> On Sat, Nov 06, 1999 at 04:58:55PM +0100, Mattias Pantzare wrote: >>>> What hapens if the data part of a write to a RAID-5 plex completes but not the >>>> parity part (or the other way)? >>>> >>> The parity is not in sync - what else? >> >> The system could detect it and recalculate the parity. Or give a warning to >> the user so the user knows that the data is not safe. > > That's not possible because you need to write more then a single > sector to keep parity in sync which is not atomic. > > In case one of the writes fail vinum will do everything needed to > work with it and to inform the user. In RAID-5, I first write the data blocks, then the parity blcoks. There are a number of scenarios here: 1. The drive containing a data or parity block goes down. In this case, the subdisks of that block will be marked 'crashed'. The subdisk to which the write went will be marked 'stale'. When the drive is brought up again (manually), the data will be recreated. I've been thinking about keeping a log somewhere of what needs to be updated, but this carries dangers of corruption. At the moment I require that the entire subdisk be rewritten. This will also recreate parity where necessary. 2. The subdisk containing a data or parity block has an unrecoverable I/O error. This is pretty much the same as the previous case, except that the other subdisks don't crash. 3. The system crashes before writing the first data block for a RAID-5 stripe. The updates are lost (obviously). When the system comes up, the data should be consistent. 4. The system crashes after writing the first data block for a RAID-5 stripe and before writing the last data block. When the system comes up, both data and parity are inconsistent. 5. The system crashes after writing the last data block for a RAID-5 stripe and before writing the last parity block. When the system comes up, data is consistent, and parity is inconsistent. There are a number of ways of dealing with situations 4 and 5. The real problem is that they only occur when the system crashes, so whatever recovery information is required must be stored in non-volatile storage. Some systems do include a NOVRAM for this kind of information, but in general purpose systems the only possibility is to write the information to disk, which would make the inherently slow RAID-5 write even slower. My attitude here is that RAID-5 writes are comparatively infrequent, and so are crashes. In the case of (5), you could rebuild parity after a crash. In the case of (4), I have no good answer. Suggestions welcome. Having said that, I probably need to revise the code which sequentializes the data and parity writes. It currently uses the B_ORDERED flag in the buffer headers, and I'm not sure that's enough. I should probably modify it to confirm that the data blocks are written before starting to write the parity blocks. > Vinum will take the subdisk down because such drives should work with > write reallocation enabled and such a disk is badly broken if you receive a > write error. > > If the system panics or power fails between such a write there is no way to > find out if the parity is broken beside verifying the complete plex after > reboot - the problem should be the same with all usual hard and software > solutions - greg already begun or finished recalculating and checking the > parity. > I asume that's the reason why some systems use 520 byte sectors - maybe they > write timestamps or generationnumbers in a single write within the sector. In fact, the 520 byte sectors are used to protect against data corruption between the disk and the controller. They won't help in this scenario. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19991113213325.57908>