From owner-freebsd-questions@freebsd.org Tue Apr 19 14:31:39 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 71436B13CCC for ; Tue, 19 Apr 2016 14:31:39 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.70.90]) by mx1.freebsd.org (Postfix) with ESMTP id 3821610E7 for ; Tue, 19 Apr 2016 14:31:38 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: by cosmo.uchicago.edu (Postfix, from userid 48) id BF98BCB8C9D; Tue, 19 Apr 2016 09:31:37 -0500 (CDT) Received: from 128.135.52.6 (SquirrelMail authenticated user valeri) by cosmo.uchicago.edu with HTTP; Tue, 19 Apr 2016 09:31:37 -0500 (CDT) Message-ID: <30732.128.135.52.6.1461076297.squirrel@cosmo.uchicago.edu> In-Reply-To: References: <571533F4.8040406@bananmonarki.se> <57153E6B.6090200@gmail.com> <20160418210257.GB86917@neutralgood.org> <64031.128.135.52.6.1461017122.squirrel@cosmo.uchicago.edu> Date: Tue, 19 Apr 2016 09:31:37 -0500 (CDT) Subject: Re: [Phishing]Re: Raid 1+0 From: "Valeri Galtsev" To: "Daniel Feenberg" Cc: "Valeri Galtsev" , "Kevin P. Neal" , "Shamim Shahriar" , freebsd-questions@freebsd.org Reply-To: galtsev@kicp.uchicago.edu User-Agent: SquirrelMail/1.4.8-5.el5.centos.7 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Apr 2016 14:31:39 -0000 Daniel, nice writeup. We definitely think along the same lines. (sorry about top post) On Tue, April 19, 2016 7:10 am, Daniel Feenberg wrote: > > Why do drive failures come in pairs? With second failure of the same drive it will be physical spot of platter covering more that one block, more than one track... > > [The following is based on Linux experience when the largest drives were > 300GB - I think ZFS will do much better.] > > Most of drives we have claim a MTBF of 500,000 hours. That's about 2% per > year. With three drives the chance of at least one failing is a little > less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at > least a reasonable approximation of reality (but see Schroeder and Gibson > ,2007). > > Suppose you have three drives in a RAID 5. If it takes 24 hours to replace > and reconstruct a failed drive, one is tempted to calculate that the > chance of a second drive failing before full redundancy is established is > about .02/365, or about one in a hundred thousand. The total probability > of a double failure seems like it should be about 6 in a million per year. Yes, to the contrary to me you didn't forget to multiply probability of single drive failure by the number of drives in array! My rusted brain needs cleaning... > > Our double failure rate is worse than that - the many single drive > failures are followed by a second drive failure before redundancy is > established. And here comes the question. Were all derives "surface scanned" withing a week before failure event? By that I mean anything from: 1. RAID verify operation - which reads all stripes and compares to the result of redundant stripe (or stripes in case of RAID-6)? This usually just a read operation, read of all blocks of each drive, and thanks to CRC, even read usually discovers newly developed bad blocks (although I agree, write-read-compare is better) 2. Drive surface scan run on all derives? Some hardware RAID units (like Infortrend) have that. It is not explicitly disclosed what is that, but most probably it is read (and remember) drive sector, write to sector, read sector and compare with what was written, then write original content of sector If neither of the above is scheduled to be performed routinely, then I would consider RAID not well configured. And second failure of another drive probably has nothing to do with first failure that triggered RAID rebuild, it was just something that happened during long period of time and was sitting not discovered. This resolves in my mind mystery that "drive failures come in pairs" which probability theory tells us is unlikely. Of course, rebuild is a stress, which may shift the probability somewhat, but you had the same stress a week ago, and two weeks ago and so on - when you were doing RAID verification or drive surface scan. Incidentally, if you allow rebuild with drive that failed not fatally, like I do, you will notice the same drive may fail during first rebuild attempt, (and maybe during second, and maybe third), but finally rebuild may succeed. And this failure with the same drive is most likely due to some bad spot on physical surface covering several sectors/tracks. Raid rebuild forces discovery of them one at a time, but once all of them discovered and relocated, all goes well (that comes from my 3ware based RAIDs). > This prevents rebuilding the array with a new drive replacing > the original failed drive, however you can probably recover most files if > you stay in degraded mode and copy the files to a different location. It > isn't that failures are correlated because drives are from the same batch, > or the controller is at fault, or the environment is bad (common > electrical spike or heat problem). The fault lies with the Linux md > driver, which stops rebuilding parity after a drive failure at the first > point it encounters a uncorrectable read error on the remaining "good" > drives. Of course with two drives unavailable, there isn't an unambiguous > reconstruction of the bad sector, so it might be best to go to the backups > instead of continuing. At least that is the apparently the reason for the > decision. > Alternatively, if the first drive failed was readable on that sector, > (even if not reading some other sectors) it should be possible to fully > recover all the data with a high degree of confidence even if a second > drive is failed later. Since that is far from an unusual situation (a > drive will be failed for a single uncorrectable error even if further > reads are possible on other sectors) it isn't clear to us why that isn't > done. [Lack of a slot for the bad drive?] Even if that sector isn't > readable, logging the bad block, writing something recognizable to the > targets, and going on might be better than simply giving up. Agree, bad stripe from one drive shouldn't necessarily kick out this whole drive, as when there is bad stripe from another drive elsewhere this first dive may give good resembling stripe. Not 100% safe to assume, but still decently safe. > > > A single unreadable sector isn't unusual among the tens of millions of > sectors on a modern drive. If the sector has never been written to, there > is no occasion for the drive electronics or the OS to even know it is bad. > If the OS tried to write to it, the drive would automatically remap the > sector and no damage would be done - not even a log entry. But that one > bad sector will render the entire array unrecoverable no matter where on > the disk it is if one other drive has already been failed. > > Let's repeat the reliability calculation with our new knowledge of the > situation. In our experience perhaps half of drives have at least one > unreadable sector in the first year. Again assume a 6 percent chance of a > single failure. The chance of at least one of the remaining two drives > having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is > about 4.5%/year, which is .5% MORE than the 4% failure rate one would > expect from a two drive RAID 0 with the same capacity. Alternatively, if > you just had two drives with a partition on each and no RAID of any kind, > the chance of a failure would still be 4%/year but only half the data loss > per incident, which is considerably better than the RAID 5 can even hope > for under the current reconstruction policy even with the most expensive > hardware. > > The 3ware controller, has a "continue on error" rebuild policy available > as an option in the array setup. But we would really like to know more > about just what that means. What do the apparently similar RAID > controllers from Mylex, LSI Logic and Adaptec do about this? A look at > their web sites reveals no information. For some time > now we have stuck with software raid, because it renders the drives pretty > much hardware independent and there doesn't appear to be much of a > performance loss. Daniel, I enjoyed the reading! Thanks! I have noticed that I observe lower drive failure rate on my boxes than numbers you mention. How do you get the drives? Do you stick with particular manufacturers, drive models, vendors? Or all is random? I'm trying to figure out if the fact that I am very picky about hard drives (and this is almost the only computer component I am picky about) really pays off in my case. I do my best to stick to manufacturers with best reliability record, I avoid all "green" drives of any sort, if drive of particular model is manufactured in different sizes, I always go with largest size (I noticed some time ago, they run production of smaller drive sizes on poorer production lines they own, whereas they don't dare to do the same about largest in size drives of particular model). I do my best to get drives (or avoid drives) based on geographical location of production line they made on whenever I can (if you know what I mean). Sometimes cost may be 5% or so higher (pricegrabber is your enemy here), but cost for my Department of my time not wasted on dealing with failures one can avoid justifies that IMHO. I really would love everybody's comments on this last paragraph. Valeri > > daniel feenberg > ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++