From owner-freebsd-questions@freebsd.org  Tue Apr 19 14:31:39 2016
Return-Path: <owner-freebsd-questions@freebsd.org>
Delivered-To: freebsd-questions@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 71436B13CCC
 for <freebsd-questions@mailman.ysv.freebsd.org>;
 Tue, 19 Apr 2016 14:31:39 +0000 (UTC)
 (envelope-from galtsev@kicp.uchicago.edu)
Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.70.90])
 by mx1.freebsd.org (Postfix) with ESMTP id 3821610E7
 for <freebsd-questions@freebsd.org>; Tue, 19 Apr 2016 14:31:38 +0000 (UTC)
 (envelope-from galtsev@kicp.uchicago.edu)
Received: by cosmo.uchicago.edu (Postfix, from userid 48)
 id BF98BCB8C9D; Tue, 19 Apr 2016 09:31:37 -0500 (CDT)
Received: from 128.135.52.6 (SquirrelMail authenticated user valeri)
 by cosmo.uchicago.edu with HTTP;
 Tue, 19 Apr 2016 09:31:37 -0500 (CDT)
Message-ID: <30732.128.135.52.6.1461076297.squirrel@cosmo.uchicago.edu>
In-Reply-To: <alpine.LRH.2.20.1604190800060.26548@sas1.nber.org>
References: <571533F4.8040406@bananmonarki.se> <57153E6B.6090200@gmail.com>
 <20160418210257.GB86917@neutralgood.org>
 <64031.128.135.52.6.1461017122.squirrel@cosmo.uchicago.edu>
 <alpine.LRH.2.20.1604190800060.26548@sas1.nber.org>
Date: Tue, 19 Apr 2016 09:31:37 -0500 (CDT)
Subject: Re: [Phishing]Re:  Raid 1+0
From: "Valeri Galtsev" <galtsev@kicp.uchicago.edu>
To: "Daniel Feenberg" <feenberg@nber.org>
Cc: "Valeri Galtsev" <galtsev@kicp.uchicago.edu>,
 "Kevin P. Neal" <kpn@neutralgood.org>,
 "Shamim Shahriar" <shamim.shahriar@gmail.com>,
 freebsd-questions@freebsd.org
Reply-To: galtsev@kicp.uchicago.edu
User-Agent: SquirrelMail/1.4.8-5.el5.centos.7
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Priority: 3 (Normal)
Importance: Normal
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Apr 2016 14:31:39 -0000


Daniel, nice writeup. We definitely think along the same lines. (sorry
about top post)

On Tue, April 19, 2016 7:10 am, Daniel Feenberg wrote:
>
> Why do drive failures come in pairs?

With second failure of the same drive it will be physical spot of platter
covering more that one block, more than one track...

>
> [The following is based on Linux experience when the largest drives were
> 300GB - I think ZFS will do much better.]
>
> Most of drives we have claim a MTBF of 500,000 hours. That's about 2% per
> year. With three drives the chance of at least one failing is a little
> less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at
> least a reasonable approximation of reality (but see Schroeder and Gibson
> ,2007).
>
> Suppose you have three drives in a RAID 5. If it takes 24 hours to replace
> and reconstruct a failed drive, one is tempted to calculate that the
> chance of a second drive failing before full redundancy is established is
> about .02/365, or about one in a hundred thousand. The total probability
> of a double failure seems like it should be about 6 in a million per year.

Yes, to the contrary to me you didn't forget to multiply probability of
single drive failure by the number of drives in array! My rusted brain
needs cleaning...

>
> Our double failure rate is worse than that - the many single drive
> failures are followed by a second drive failure before redundancy is
> established.

And here comes the question. Were all derives "surface scanned" withing a
week before failure event? By that I mean anything from:

1. RAID verify operation - which reads all stripes and compares to the
result of redundant stripe (or stripes in case of RAID-6)? This usually
just a read operation, read of all blocks of each drive, and thanks to
CRC, even read usually discovers newly developed bad blocks (although I
agree, write-read-compare is better)

2. Drive surface scan run on all derives? Some hardware RAID units (like
Infortrend) have that. It is not explicitly disclosed what is that, but
most probably it is read (and remember) drive sector, write to sector,
read sector and compare with what was written, then write original content
of sector

If neither of the above is scheduled to be performed routinely, then I
would consider RAID not well configured. And second failure of another
drive probably has nothing to do with first failure that triggered RAID
rebuild, it was just something that happened during long period of time
and was sitting not discovered. This resolves in my mind mystery that
"drive failures come in pairs" which probability theory tells us is
unlikely. Of course, rebuild is a stress, which may shift the probability
somewhat, but you had the same stress a week ago, and two weeks ago and so
on - when you were doing RAID verification or drive surface scan.

Incidentally, if you allow rebuild with drive that failed not fatally,
like I do, you will notice the same drive may fail during first rebuild
attempt, (and maybe during second, and maybe third), but finally rebuild
may succeed. And this failure with the same drive is most likely due to
some bad spot on physical surface covering several sectors/tracks. Raid
rebuild forces discovery of them one at a time, but once all of them
discovered and relocated, all goes well (that comes from my 3ware based
RAIDs).

> This prevents rebuilding the array with a new drive replacing
> the original failed drive, however you can probably recover most files if
> you stay in degraded mode and copy the files to a different location. It
> isn't that failures are correlated because drives are from the same batch,
> or the controller is at fault, or the environment is bad (common
> electrical spike or heat problem). The fault lies with the Linux md
> driver, which stops rebuilding parity after a drive failure at the first
> point it encounters a uncorrectable read error on the remaining "good"
> drives. Of course with two drives unavailable, there isn't an unambiguous
> reconstruction of the bad sector, so it might be best to go to the backups
> instead of continuing. At least that is the apparently the reason for the
> decision.

> Alternatively, if the first drive failed was readable on that sector,
> (even if not reading some other sectors) it should be possible to fully
> recover all the data with a high degree of confidence even if a second
> drive is failed later. Since that is far from an unusual situation (a
> drive will be failed for a single uncorrectable error even if further
> reads are possible on other sectors) it isn't clear to us why that isn't
> done. [Lack of a slot for the bad drive?] Even if that sector isn't
> readable, logging the bad block, writing something recognizable to the
> targets, and going on might be better than simply giving up.

Agree, bad stripe from one drive shouldn't necessarily kick out this whole
drive, as when there is bad stripe from another drive elsewhere this first
dive may give good resembling stripe. Not 100% safe to assume, but still
decently safe.

>
>
> A single unreadable sector isn't unusual among the tens of millions of
> sectors on a modern drive. If the sector has never been written to, there
> is no occasion for the drive electronics or the OS to even know it is bad.
> If the OS tried to write to it, the drive would automatically remap the
> sector and no damage would be done - not even a log entry. But that one
> bad sector will render the entire array unrecoverable no matter where on
> the disk it is if one other drive has already been failed.
>
> Let's repeat the reliability calculation with our new knowledge of the
> situation. In our experience perhaps half of drives have at least one
> unreadable sector in the first year. Again assume a 6 percent chance of a
> single failure. The chance of at least one of the remaining two drives
> having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is
> about 4.5%/year, which is .5% MORE than the 4% failure rate one would
> expect from a two drive RAID 0 with the same capacity. Alternatively, if
> you just had two drives with a partition on each and no RAID of any kind,
> the chance of a failure would still be 4%/year but only half the data loss
> per incident, which is considerably better than the RAID 5 can even hope
> for under the current reconstruction policy even with the most expensive
> hardware.
>
> The 3ware controller, has a "continue on error" rebuild policy available
> as an option in the array setup. But we would really like to know more
> about just what that means. What do the apparently similar RAID
> controllers from Mylex, LSI Logic and Adaptec do about this? A look at
> their web sites reveals no information. For some time
> now we have stuck with software raid, because it renders the drives pretty
> much hardware independent and there doesn't appear to be much of a
> performance loss.

Daniel, I enjoyed the reading! Thanks!

I have noticed that I observe lower drive failure rate on my boxes than
numbers you mention. How do you get the drives? Do you stick with
particular manufacturers, drive models, vendors? Or all is random?

I'm trying to figure out if the fact that I am very picky about hard
drives (and this is almost the only computer component I am picky about)
really pays off in my case. I do my best to stick to manufacturers with
best reliability record, I avoid all "green" drives of any sort, if drive
of particular model is manufactured in different sizes, I always go with
largest size (I noticed some time ago, they run production of smaller
drive sizes on poorer production lines they own, whereas they don't dare
to do the same about largest in size drives of particular model). I do my
best to get drives (or avoid drives) based on geographical location of
production line they made on whenever I can (if you know what I mean).
Sometimes cost may be 5% or so higher (pricegrabber is your enemy here),
but cost for my Department of my time not wasted on dealing with failures
one can avoid justifies that IMHO.

I really would love everybody's comments on this last paragraph.

Valeri

>
> daniel feenberg
>


++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++