Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 6 Jul 2012 03:08:46 -0400
From:      Zaphod Beeblebrox <zbeeble@gmail.com>
To:        Jason Usher <jusher71@yahoo.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: vdev/pool math with combined raidzX vdevs...
Message-ID:  <CACpH0MemwZDCXsh4USzeFHUO8fbW09TSOYyVPa2dWmKc8N%2B=_Q@mail.gmail.com>
In-Reply-To: <1341537402.58301.YahooMailClassic@web122504.mail.ne1.yahoo.com>
References:  <1341537402.58301.YahooMailClassic@web122504.mail.ne1.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Is there some penalty for not googling some basic stats course?  OK.
This is from memory (hint: you probably should google).

p(f) ... the probably of failure of one drive over some unit time (say
one year).  A two drive RAID-0 array has probability p(2dr0) = 2 *
p(f) + p(f).  That is (for the logic guys): the array fails if either
drive fails.  A two drive RAID-1 array has probability p(2dr1) = p(f)
* p(f) ... that is: the array fails only if both drives fail.  These
are simple probabilities.  It doesn't count that the RAID-1 case can
be said to be more complex ... ie: given a certain failure
distribution and a certain replacement distribution, what is the
chance of total failure given a single drive failure (ie: 2nd drive
failing before you replace the first drive to fail).

... but you get the jist.  If no replacements are allowed and 10% of
drives fail in a year, the R0 array has a 20% chance of failing in 1
year and the R1 array has a 2% chance

... this also says nothing about the fact that the drives are the same
and have done mostly the same reads and writes and that their failure
may not be independant.

Geez... it's getting complex.

Now... we start getting into the hard stuff.  For a RAID-Z(1) array,
you want to think about the possibility of 2 failures out of 12 drives
(or ... if you're feeling up to it, the probability of the first
failure and then the probability of the second failure given the first
before you can replace it).  p(12drz1) = 12 * p(f) * 11 * p(f)  --- if
no replacements are allowed and drive failures are independent.  To
kick it up a notch, the "11 * p(f)" can be replaced with eleven times
the probability of failure before replacement --- which you can
calculate with your MTBF tables and your service level for replacing
drives in the array.

Similarly,

p(12drz2) = 12 * p(f) * 11 * p(f) * 10 * p(f)
p(12drz3) = 12 * p(f) * 11 * p(f) * 10 * p(f) * 9 * p(f)

... again with those assumptions are more complex probabilities given
your replacement strategy.

... so, again with simplistic assumptions,

p(36drz3 --- 12 drives, 3 groups) = p(12drz3) * 3

A "vanilla" RAID-Z2 (if I make an assumption to what you're saying) is:

p(36drz2) = 36 * p(f) * 35 * p(f)

... but I can't directly answer you question without knowing a) the
structure of the RAID-Z2 array and p(f).  If we use a 1% figure for
p(f), then P(36drz3,12,3) = 0.035% and p(36drz2) = 4.3%

... that is the raid-Z2 case (one group of 36 drives, two redundant
--- which is crazy) is 4.3% likely to fail where the 3-group RAID-Z3
is only 0.035% likely to fail.  As a more sane comparison,
p(36drz2,12,3) = 3.8%

now it's worth saying that all these calculations assume that you
never come to replace drives in the array and that drive failures are
independent... neither of these is likely true.  If you had (say) a 4
hour 7/24 contract with someone, the chances of more drives failing
before a failed drive is replaced are much smaller.  As for the
independence of drive failures... that's a discussion over beer.

Put simply, you add the probabilities of things where any can cause
the failure (either drive of R0 failing, any one of the 3 plexes of a
complex array failing) and you multiply things where all must fail to
produce failure.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACpH0MemwZDCXsh4USzeFHUO8fbW09TSOYyVPa2dWmKc8N%2B=_Q>