Date: 19 Nov 1999 11:33:58 -0500 From: Randell Jesup <rjesup@wgate.com> To: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <ybuso22qw3t.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net> In-Reply-To: Greg Lehey's message of "Tue, 16 Nov 1999 20:49:16 -0500" References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net> <19991116204916.44107@mojave.sitaranetworks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Greg Lehey <grog@mojave.sitaranetworks.com> writes: >> When the disks come back up (dirty), check all the parity. >> The stripe that was being written will fail to check. In case 4, the data >> and parity are wrong, and in case 5, just the parity, but you don't know >> which. If you handle case 4, you can handle case 5 the same way. >> Obviously you've had a write failure, but usually the FS can deal with >> that possibility (with the chance of lost data, true). Some form of >> information passed out about what sector(s) were trashed might be useful >> in recovery if you're not using default UFS/fsck. > >Well, you're still left with the dilemma. Worse, this check makes >fsck look like an instantaneous operation: you have to read the entire >contents of every disk. For a 500 GB database spread across 3 LVD >controllers, you're looking at several hours. True. Not that it may matter, but you could have dirty flags for each cylinder group (or whatever). This both adds locality (shorter seeks) and reduces the amount needed to recheck. If an area hasn't been written to 'recently', the dirty flag for the area gets rewritten to clean. This allows you to keep the amount of the disk that needs to be reread on a crash down to a very manageable level. Tuning the size of the groups covered by a flag and the timeout to rewrite a flag to clean would take a little work. >> If it checks, then the data was all written before any crash, >> and all is fine. > >That's the simple case. That's certainly true. >> So the biggest trick here is recognizing the fact that the system >> crashed. You could reserve a block (or set of blocks scattered about) on >> each drive for dirty flags, and only mark a disk clean if it hasn't had >> writes in <some configurable amount of time>. This keeps the write >> overhead down without requiring NVRAM. There are other evil tricks: with >> SCSI, you might be able to change some innocuous mode parameter and use >> it as a dirty flag, though this probably has at least as much overhead >> as reserving a dirty-flag block. And of course if you have NVRAM, store >> the dirty bit there. Hmmmmm. Maybe in the PC's clock chip - they >> generally have several bits of NVRAM..... (On the Amiga we used those >> bits for storing things like SCSI Id, boot spinup delay, etc.) >> >> Alternatively, you could hide the dirty flag at a higher semantic >> level, by (at the OS level) recognizing a system that wasn't shut down >> properly and invoking the vinum re-synchronizer. So long as the sectors >> with problems aren't needed to boot the kernel and recognize this that will >> work. > >Basically, the way I see it, we have three options: > >1. Disks never crash, and anyway, we don't write to them. Ignore the > problem and deal with it if it comes to bite us. > >2. Get an NVRAM board and use it for this purpose. How much is commonly stored in nvram boards for raid? If it's merely the location of the write, _maybe_ clock-chip memory might work (if writing to it that often doesn't slow down the system - I don't remember how fast the interface is). If it's the entire sector, well then we're screwed without it or #3 - or rather we could have a corrupted stripe after a crash. Oh well. >3. Bite the bullet and write intention logs before each write. > VERITAS has this as an option. Probably worthwhile. >These options don't have to be mutually exclusive. It's quite >possible to implement both ((1) doesn't need implementation :-) and >leave it to the user to decide which to use. Quite so. BTW, I assume I'm correct in assuming that vinum normally works on drives with write-behind disabled... >> At the cost of performance, you could use some bytes of each sector >> for generation numbers, and know in case 5 that the data is correct. >> Obviously case 4 will still fail. > >No, the way things work, this would be very expensive. We'd have to >move the data to a larger buffer and set the flags, and it would also >require at least reformatting the drive, assuming it's possible to set >a different sector. There are better ways to do this. Well, I was assuming you'd use some bytes from the existing sectorsize (such as 511 bytes of user data per sector, 1 byte of generation). We're talking lots of extra CPU overhead on read or write, however, to transfer the data into alternative buffers before write and to invert that on read - not to mention that higher-level code tends to be inflexible in regard to sector sizes being powers of two (or multiples of 512 for that matter). Does vinum do any transfers of user data into alternative buffers before posting it's writes, or does it just use gather/scatter lists? -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com CDA II has been passed and signed, sigh. The lawsuit has been filed. Please support the organizations fighting it - ACLU, EFF, CDT, etc. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ybuso22qw3t.fsf>