Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Nov 1999 21:33:25 -0500
From:      Greg Lehey <grog@mojave.sitaranetworks.com>
To:        Bernd Walter <ticso@cicely.de>, Mattias Pantzare <pantzer@ludd.luth.se>
Cc:        freebsd-fs@FreeBSD.ORG
Subject:   Re: RAID-5 and failure
Message-ID:  <19991113213325.57908@mojave.sitaranetworks.com>
In-Reply-To: <19991106183316.A9420@cicely7.cicely.de>; from Bernd Walter on Sat, Nov 06, 1999 at 06:33:16PM %2B0100
References:  <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Saturday,  6 November 1999 at 18:33:16 +0100, Bernd Walter wrote:
> On Sat, Nov 06, 1999 at 06:16:47PM +0100, Mattias Pantzare wrote:
>>> On Sat, Nov 06, 1999 at 04:58:55PM +0100, Mattias Pantzare wrote:
>>>> What hapens if the data part of a write to a RAID-5 plex completes but not the
>>>> parity part (or the other way)?
>>>>
>>> The parity is not in sync - what else?
>>
>> The system could detect it and recalculate the parity. Or give a warning to
>> the user so the user knows that the data is not safe.
>
> That's not possible because you need to write more then a single
> sector to keep parity in sync which is not atomic.
>
> In case one of the writes fail vinum will do everything needed to
> work with it and to inform the user.

In RAID-5, I first write the data blocks, then the parity blcoks.
There are a number of scenarios here:

1.  The drive containing a data or parity block goes down.

    In this case, the subdisks of that block will be marked
    'crashed'.  The subdisk to which the write went will be marked
    'stale'.  When the drive is brought up again (manually), the data
    will be recreated.

    I've been thinking about keeping a log somewhere of what needs to
    be updated, but this carries dangers of corruption.  At the moment
    I require that the entire subdisk be rewritten.  This will also
    recreate parity where necessary.

2.  The subdisk containing a data or parity block has an unrecoverable
    I/O error.

    This is pretty much the same as the previous case, except that the
    other subdisks don't crash.

3.  The system crashes before writing the first data block for a
    RAID-5 stripe.

    The updates are lost (obviously).  When the system comes up, the
    data should be consistent.

4.  The system crashes after writing the first data block for a RAID-5
    stripe and before writing the last data block.

    When the system comes up, both data and parity are inconsistent.

5.  The system crashes after writing the last data block for a RAID-5
    stripe and before writing the last parity block.

    When the system comes up, data is consistent, and parity is
    inconsistent.

There are a number of ways of dealing with situations 4 and 5.  The
real problem is that they only occur when the system crashes, so
whatever recovery information is required must be stored in
non-volatile storage.  Some systems do include a NOVRAM for this kind
of information, but in general purpose systems the only possibility is
to write the information to disk, which would make the inherently slow
RAID-5 write even slower.  My attitude here is that RAID-5 writes are
comparatively infrequent, and so are crashes.  In the case of (5), you
could rebuild parity after a crash.  In the case of (4), I have no
good answer.  Suggestions welcome.

Having said that, I probably need to revise the code which
sequentializes the data and parity writes.  It currently uses the
B_ORDERED flag in the buffer headers, and I'm not sure that's enough.
I should probably modify it to confirm that the data blocks are
written before starting to write the parity blocks.

> Vinum will take the subdisk down because such drives should work with
> write reallocation enabled and such a disk is badly broken if you receive a
> write error.
>
> If the system panics or power fails between such a write there is no way to
> find out if the parity is broken beside verifying the complete plex after
> reboot - the problem should be the same with all usual hard and software
> solutions - greg already begun or finished recalculating and checking the
> parity.
> I asume that's the reason why some systems use 520 byte sectors - maybe they
> write timestamps or generationnumbers in a single write within the sector.

In fact, the 520 byte sectors are used to protect against data
corruption between the disk and the controller.  They won't help in
this scenario.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19991113213325.57908>