Skip site navigation (1)Skip section navigation (2)
Date:      16 Nov 1999 12:15:17 -0500
From:      Randell Jesup <rjesup@wgate.com>
To:        Greg Lehey <grog@lemis.com>
Cc:        freebsd-fs@FreeBSD.ORG
Subject:   Re: RAID-5 and failure
Message-ID:  <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>
In-Reply-To: Greg Lehey's message of "Sat, 13 Nov 1999 21:33:25 -0500"
References:  <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Greg Lehey <grog@mojave.sitaranetworks.com> writes:
>In RAID-5, I first write the data blocks, then the parity blcoks.
>There are a number of scenarios here:

>4.  The system crashes after writing the first data block for a RAID-5
>    stripe and before writing the last data block.
>
>    When the system comes up, both data and parity are inconsistent.
>
>5.  The system crashes after writing the last data block for a RAID-5
>    stripe and before writing the last parity block.
>
>    When the system comes up, data is consistent, and parity is
>    inconsistent.
>
>There are a number of ways of dealing with situations 4 and 5.  The
>real problem is that they only occur when the system crashes, so
>whatever recovery information is required must be stored in
>non-volatile storage.  Some systems do include a NOVRAM for this kind
>of information, but in general purpose systems the only possibility is
>to write the information to disk, which would make the inherently slow
>RAID-5 write even slower.  My attitude here is that RAID-5 writes are
>comparatively infrequent, and so are crashes.  In the case of (5), you
>could rebuild parity after a crash.  In the case of (4), I have no
>good answer.  Suggestions welcome.

	Well, assuming that vinum can recognize that there might have been
outstanding writes (via the equivalent of a dirty flag):

	When the disks come back up (dirty), check all the parity.
The stripe that was being written will fail to check.  In case 4, the data
and parity are wrong, and in case 5, just the parity, but you don't know
which.  If you handle case 4, you can handle case 5 the same way.
Obviously you've had a write failure, but usually the FS can deal with
that possibility (with the chance of lost data, true).  Some form of
information passed out about what sector(s) were trashed might be useful
in recovery if you're not using default UFS/fsck.

	If it checks, then the data was all written before any crash,
and all is fine.

	So the biggest trick here is recognizing the fact that the system
crashed.  You could reserve a block (or set of blocks scattered about) on
each drive for dirty flags, and only mark a disk clean if it hasn't had
writes in <some configurable amount of time>.  This keeps the write
overhead down without requiring NVRAM.  There are other evil tricks: with
SCSI, you might be able to change some innocuous mode parameter and use
it as a dirty flag, though this probably has at least as much overhead
as reserving a dirty-flag block.  And of course if you have NVRAM, store
the dirty bit there.  Hmmmmm.  Maybe in the PC's clock chip - they
generally have several bits of NVRAM.....  (On the Amiga we used those
bits for storing things like SCSI Id, boot spinup delay, etc.)

	Alternatively, you could hide the dirty flag at a higher semantic
level, by (at the OS level) recognizing a system that wasn't shut down
properly and invoking the vinum re-synchronizer.  So long as the sectors
with problems aren't needed to boot the kernel and recognize this that will
work.

>> I asume that's the reason why some systems use 520 byte sectors - maybe they
>> write timestamps or generationnumbers in a single write within the sector.
>
>In fact, the 520 byte sectors are used to protect against data
>corruption between the disk and the controller.  They won't help in
>this scenario.

	At the cost of performance, you could use some bytes of each sector
for generation numbers, and know in case 5 that the data is correct.
Obviously case 4 will still fail.

-- 
Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94)
rjesup@wgate.com
CDA II has been passed and signed, sigh.  The lawsuit has been filed.  Please
support the organizations fighting it - ACLU, EFF, CDT, etc.



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ybuk8nis6hm.fsf>