Date: 16 Nov 1999 12:15:17 -0500 From: Randell Jesup <rjesup@wgate.com> To: Greg Lehey <grog@lemis.com> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net> In-Reply-To: Greg Lehey's message of "Sat, 13 Nov 1999 21:33:25 -0500" References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Greg Lehey <grog@mojave.sitaranetworks.com> writes: >In RAID-5, I first write the data blocks, then the parity blcoks. >There are a number of scenarios here: >4. The system crashes after writing the first data block for a RAID-5 > stripe and before writing the last data block. > > When the system comes up, both data and parity are inconsistent. > >5. The system crashes after writing the last data block for a RAID-5 > stripe and before writing the last parity block. > > When the system comes up, data is consistent, and parity is > inconsistent. > >There are a number of ways of dealing with situations 4 and 5. The >real problem is that they only occur when the system crashes, so >whatever recovery information is required must be stored in >non-volatile storage. Some systems do include a NOVRAM for this kind >of information, but in general purpose systems the only possibility is >to write the information to disk, which would make the inherently slow >RAID-5 write even slower. My attitude here is that RAID-5 writes are >comparatively infrequent, and so are crashes. In the case of (5), you >could rebuild parity after a crash. In the case of (4), I have no >good answer. Suggestions welcome. Well, assuming that vinum can recognize that there might have been outstanding writes (via the equivalent of a dirty flag): When the disks come back up (dirty), check all the parity. The stripe that was being written will fail to check. In case 4, the data and parity are wrong, and in case 5, just the parity, but you don't know which. If you handle case 4, you can handle case 5 the same way. Obviously you've had a write failure, but usually the FS can deal with that possibility (with the chance of lost data, true). Some form of information passed out about what sector(s) were trashed might be useful in recovery if you're not using default UFS/fsck. If it checks, then the data was all written before any crash, and all is fine. So the biggest trick here is recognizing the fact that the system crashed. You could reserve a block (or set of blocks scattered about) on each drive for dirty flags, and only mark a disk clean if it hasn't had writes in <some configurable amount of time>. This keeps the write overhead down without requiring NVRAM. There are other evil tricks: with SCSI, you might be able to change some innocuous mode parameter and use it as a dirty flag, though this probably has at least as much overhead as reserving a dirty-flag block. And of course if you have NVRAM, store the dirty bit there. Hmmmmm. Maybe in the PC's clock chip - they generally have several bits of NVRAM..... (On the Amiga we used those bits for storing things like SCSI Id, boot spinup delay, etc.) Alternatively, you could hide the dirty flag at a higher semantic level, by (at the OS level) recognizing a system that wasn't shut down properly and invoking the vinum re-synchronizer. So long as the sectors with problems aren't needed to boot the kernel and recognize this that will work. >> I asume that's the reason why some systems use 520 byte sectors - maybe they >> write timestamps or generationnumbers in a single write within the sector. > >In fact, the 520 byte sectors are used to protect against data >corruption between the disk and the controller. They won't help in >this scenario. At the cost of performance, you could use some bytes of each sector for generation numbers, and know in case 5 that the data is correct. Obviously case 4 will still fail. -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com CDA II has been passed and signed, sigh. The lawsuit has been filed. Please support the organizations fighting it - ACLU, EFF, CDT, etc. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ybuk8nis6hm.fsf>