FreeBSD Mail Archives

Date:      Tue, 16 Nov 1999 20:49:16 -0500
From:      Greg Lehey <grog@mojave.sitaranetworks.com>
To:        Randell Jesup <rjesup@wgate.com>
Cc:        freebsd-fs@FreeBSD.ORG
Subject:   Re: RAID-5 and failure
Message-ID:  <19991116204916.44107@mojave.sitaranetworks.com>
In-Reply-To: <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>; from Randell Jesup on Tue, Nov 16, 1999 at 12:15:17PM -0500
References:  <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>

On Tuesday, 16 November 1999 at 12:15:17 -0500, Randell Jesup wrote:
> Greg Lehey <grog@mojave.sitaranetworks.com> writes:
>> In RAID-5, I first write the data blocks, then the parity blcoks.
>> There are a number of scenarios here:
>
>> 4.  The system crashes after writing the first data block for a RAID-5
>>    stripe and before writing the last data block.
>>
>>    When the system comes up, both data and parity are inconsistent.
>>
>> 5.  The system crashes after writing the last data block for a RAID-5
>>    stripe and before writing the last parity block.
>>
>>    When the system comes up, data is consistent, and parity is
>>    inconsistent.
>>
>> There are a number of ways of dealing with situations 4 and 5.  The
>> real problem is that they only occur when the system crashes, so
>> whatever recovery information is required must be stored in
>> non-volatile storage.  Some systems do include a NOVRAM for this kind
>> of information, but in general purpose systems the only possibility is
>> to write the information to disk, which would make the inherently slow
>> RAID-5 write even slower.  My attitude here is that RAID-5 writes are
>> comparatively infrequent, and so are crashes.  In the case of (5), you
>> could rebuild parity after a crash.  In the case of (4), I have no
>> good answer.  Suggestions welcome.
>
> 	Well, assuming that vinum can recognize that there might have been
> outstanding writes (via the equivalent of a dirty flag):
>
> 	When the disks come back up (dirty), check all the parity.
> The stripe that was being written will fail to check.  In case 4, the data
> and parity are wrong, and in case 5, just the parity, but you don't know
> which.  If you handle case 4, you can handle case 5 the same way.
> Obviously you've had a write failure, but usually the FS can deal with
> that possibility (with the chance of lost data, true).  Some form of
> information passed out about what sector(s) were trashed might be useful
> in recovery if you're not using default UFS/fsck.

Well, you're still left with the dilemma.  Worse, this check makes
fsck look like an instantaneous operation: you have to read the entire
contents of every disk.  For a 500 GB database spread across 3 LVD
controllers, you're looking at several hours.

> 	If it checks, then the data was all written before any crash,
> and all is fine.

That's the simple case.

> 	So the biggest trick here is recognizing the fact that the system
> crashed.  You could reserve a block (or set of blocks scattered about) on
> each drive for dirty flags, and only mark a disk clean if it hasn't had
> writes in <some configurable amount of time>.  This keeps the write
> overhead down without requiring NVRAM.  There are other evil tricks: with
> SCSI, you might be able to change some innocuous mode parameter and use
> it as a dirty flag, though this probably has at least as much overhead
> as reserving a dirty-flag block.  And of course if you have NVRAM, store
> the dirty bit there.  Hmmmmm.  Maybe in the PC's clock chip - they
> generally have several bits of NVRAM.....  (On the Amiga we used those
> bits for storing things like SCSI Id, boot spinup delay, etc.)
>
> 	Alternatively, you could hide the dirty flag at a higher semantic
> level, by (at the OS level) recognizing a system that wasn't shut down
> properly and invoking the vinum re-synchronizer.  So long as the sectors
> with problems aren't needed to boot the kernel and recognize this that will
> work.

Basically, the way I see it, we have three options:

1.  Disks never crash, and anyway, we don't write to them.  Ignore the
    problem and deal with it if it comes to bite us.

2.  Get an NVRAM board and use it for this purpose.

3.  Bite the bullet and write intention logs before each write.
    VERITAS has this as an option.

These options don't have to be mutually exclusive.  It's quite
possible to implement both ((1) doesn't need implementation :-) and
leave it to the user to decide which to use.

>>> I asume that's the reason why some systems use 520 byte sectors - maybe they
>>> write timestamps or generationnumbers in a single write within the sector.
>>
>> In fact, the 520 byte sectors are used to protect against data
>> corruption between the disk and the controller.  They won't help in
>> this scenario.
>
> 	At the cost of performance, you could use some bytes of each sector
> for generation numbers, and know in case 5 that the data is correct.
> Obviously case 4 will still fail.

No, the way things work, this would be very expensive.  We'd have to
move the data to a larger buffer and set the flags, and it would also
require at least reformatting the drive, assuming it's possible to set
a different sector.  There are better ways to do this.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19991116204916.44107>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation