Date: Sat, 02 Aug 2014 01:21:54 -0500 From: Scott Bennett <bennett@sdf.org> To: freebsd-questions@freebsd.org Cc: Paul Kraus <paul@kraus-haus.org> Subject: Re: gvinum raid5 vs. ZFS raidz Message-ID: <201408020621.s726LsiA024208@sdf.org>
next in thread | raw e-mail | index | archive | help
On Tue, 29 Jul 2014 12:01:36 -0400 Paul Kraus <paul@kraus-haus.org= > wrote: >On Jul 29, 2014, at 4:27, Scott Bennett <bennett@sdf.org> wrote: > >> I want to set up a couple of software-based RAID devices across >> identically sized partitions on several disks. At first I thought t= hat >> gvinum's raid5 would be the way to go, but now that I have finally f= ound >> and read some information about raidz, I am unsure which to choose. = My >> current, and possibly wrong, understanding about the two methods' mo= st >> important features (to me, at least) can be summarized as follows. > >Disclaimer, I have experience with ZFS but not your other alternative. Okay, I appreciate the ZFS info anyway. Maybe someone with gvinum experience will weigh in at some point. > >https://www.listbox.com/subscribe/?listname=3Dzfs@lists.illumos.org Thanks. I'll check into it. > >> =09=09raid5=09=09=09=09=09raidz >>=20 >> Has parity checking, but any parity=09=09Has parity checking *and* >> errors identified are assumed to be=09=09frequently spaced checksums > >ZFS checksums all data for errors. If there is redundancy (mirror, rai= d, copies > 1) ZFS will transparently repair damaged data (but incremen= t the ?checksum? error count so you can know via the zpool status comma= nd that you *are* hitting errors). > ><snip> > >> Can be expanded by the addition of more=09=09Can only be expanded by >> spindles via a "gvinum grow" operation.=09=09replacing all component= s with >> =09=09=09=09=09=09larger components. The number > >All ZFS devices are derived from what are called top level vdevs (virt= ual devices). The data is striped across all of the top level vdevs. Ea= ch vdev may be composed of a single drive. mirror, or raid (z1, z2, or = z3). So you can create a mixed zpool (not recommended for a variety of = reasons) with a different type of vdev for each vdev. The way to expand= any ZFS zpool is to add additional vdevs (beyond the replace all drive= s in a single vdev and then grow to fill the new drives). So you can cr= eate a zpool with one raidz1 vdev and then later add a second raidz1 vd= ev. Or more commonly, start with a mirror vdev and then add a second, t= hird, fourth (etc.) mirror vdev. [Ouch. Trying to edit a response into entire paragraphs on single= lines is a drag.] > >It is this two tier structure that is one of ZFSes strengths. It is al= so a feature that is not well understood. > I understood that, but apparently I didn't express it well enough in my comparison table. Thanks, though, for the confirmation of what I wrote. GEOM devices can be built upon other GEOM devices, too, as can gvinum devices within some constraints. ><snip> > >> Does not support migration to any other=09=09Does not support migrat= ion >> RAID levels or their equivalents.=09=09between raidz levels, even by > >Correct. Once you have created a vdev, that vdev must remain the same = type. You can add mirrors to a mirror vdev, but you cannot add drives o= r change raid level to raidz1, raidz2, or raidz3 vdevs. Too bad. Increasing the raidz level ought to be not much more difficult than growing the raidz device by adding more spindles. Doing the latter ought to be no more difficult that doing it with gvinum's stripe or raid5 devices. Perhaps the ZFS developers will eventually implement these capabilities. (A side thought: gstripe and graid3 devices ought also to be expandable in this manner, although the result= ing number of graid3 components would still need to be 2^n + 1.) > ><snip> > >> Does not support additional parity=09=09Supports one (raidz2) or two >> dimensions a la RAID6.=09=09=09=09(raidz3) additional parity > >ZFS parity is handled slightly differently than for traditional raid-5= (as well as the striping of data / parity blocks). So you cannot just = count on loosing 1, 2, or 3 drives worth of space to parity. See Matt A= hren?s Blog entry here http://blog.delphix.com/matt/2014/06/06/zfs-stri= pe-width/ for (probably) more data on this than you want :-) And here h= ttps://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gp= T689wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674 is his spreadsh= eet that relates space lost due to parity to number of drives in raidz = vdev and data block size (yes, the amount of space lost to parity carie= s with data block, not configured filesystem block size!). There is a s= eparate tab for each of RAIDz1, RAIDz2, and RAIDz3. > Yes, I had found both of those by following links from the ZFS mat= erial at the freebsd.org web site. However, lynx(1) is the only web browser = I can use at present because X11 was screwed on my system by an update that c= hanged the ABI for the server and various loadable modules, but did not update= the keyboard driver module or the pointing device driver module. If I star= t X up, it rejects those two driver modules due to the incompatible ABIs, so I = have no further influence on the system short of an ACPI shutdown triggered = by pushing the power button briefly. Until I get the disk situation settl= ed, I have no easy way to rebuild X11. Anyway, using lynx(1), it is very h= ard to make any sense of the spreadsheet. ><snip> > >> Fast performance because each block=09=09Slower performance because = each >> is on a separate spindle from the=09=09block is spread across all >> from the previous and next blocks.=09=09spindles a la RAID3, so many >> =09=09=09=09=09=09simultaneous I/O operations are >> =09=09=09=09=09=09required for each block. > >ZFS performance is never that simple as I/O is requested from the driv= e in parallel. Unless you are saturating the controller you should be a= ble to keep all the drive busy at once. Also note that ZFS does NOT suf= fer the RAID-5 read-modify-write penalty on writes as every write is a = new write to disk (there is no modification of existing disk blocks), t= his is referred to as being Copy On Write (COW). > Again, your use of single-line paragraphs makes it tough to respon= d to your several points in-line. The information that I read on-line said that each raidz data bloc= k is distributed across all devices in the raidzN device, just like in RAID3= or RAID4. That means that, whether reading or writing one data block, *al= l* of he drives require a read or a write, not just one as would be the case = in RAID5. So a raidzN device will require N I/O operations * m data block= s to be read/written, not just m I/O operations. That was the point I was m= aking in the table entry above, i.e., ZFS raidz, like RAID3 and RAID4, is man= y times as I/O-intensive as RAID5. In essence, reading or writing 100 da= ta blocks from a raidz is, at best, no faster than reading 100 blocks from= a single drive. At worst, there will be bus conflicts leading to overrun= s and full rotation delays in the process of gathering all the fragments in a block, thus performing even slower than a single drive. I.e., raidzN o= ffers no speed advantage to using multiple spindles, just like RAID3/RAID4. = In other words, the data are not really striped but rather distributed in parallel. So I guess the question is, was what I read about raidz inco= rrect, i.e., are individual data blocks *not* divided into a fragment on each = and every spindle minus the raidz level (number of parity dimensions)? >> =09=09=09=09----------------------- >> I hoped to start with a minimal number of components and eventua= lly >> add more components to increase the space available in the raid5 or = raidz >> devices. Increasing their sizes that way would also increase the to= tal >> percentage of space in the devices devoted to data rather than parit= y, as >> well as improving the performance enhancement of the striping. For = various >> reasons, having to replace all component spindles with larger-capaci= ty >> components is not a viable method of increasing the size of the raid= 5 or >> raidz devices in my case. That would appear to rule out raidz. > >Yup. Bummer. Oh, well. > >> OTOH, the very large-capacity drives available in the last two o= r >> three years appear not to be very reliable(*) compared to older driv= es of >> 1 TB or smaller capacities. gvinum's raid5 appears not to offer goo= d >> protection against, nor any repair of, damaged data blocks. > >Yup. Unless you use ZFS plan on suffering silent data corruption due t= o the uncorrectable (and undetectable by the drive) error rate off of l= arge drives. All drives suffer uncorrectable errors, read errors that t= he drive itself does not realize are errors. With traditional filesyste= ms this bad data is returned to the OS and in some cases may cause a fi= lesystem panic and in others just bad data returned to the application.= This is one of the HUGE benefits of ZFS, it catches those errors. > I think you've convinced me right there. Although RAID[3456] offe= rs protection against drive failures, it offers no protection against sile= nt data corruption, which seems to be common on the large-capacity drives = on the market for the last three or four years. ><snip> > >> Thanks to three failed external drives and >> apparently not fully reliable replacements, compounded by a bad port= s >> update two or three months ago, I have no functioning X11 and no spa= ce >> set up any longer in which to build ports to fix the X11 problem, so= I >> really want to get the disk situation settled ASAP. Trying to keep = track >> of everything using only syscons and window(1) is wearing my patienc= e >> awfully thin. > >My home server is ZFS only and I have 2 drives mirrored for the OS and= 5 drives in a raidz2 for data with one hot spare. I have suffered 3 dr= ive failures (all Seagate), two of which took the four drives in my ext= ernal enclosure offline (damn sata port multipliers). I have had NO dat= a loss or corruption! Bravo, then. Looks like ZFS raidz is what I need. Unfortunately, I only have four drives available for the raidz at present, so it looks like I'll need to save up for at least one additional drive and probabl= y two for a raidz2 that doesn't sacrifice an unacceptably high fraction o= f the total space to parity blocks. :-( On my "budget" (ha!) that could = be=20 several months or more, by which time three of the four I currently hav= e will be out of warranty. I suppose more failures could also occur duri= ng that time. Sigh. > >I started like you, wanting to have some drives and add more later. I = started with a pair of 1TB drives mirrored, then added a second pair to= double my capacity. The problem with 2-way mirrors is that the MTTDL (= Mean Time To Data Loss) is much lower than with RAIDz2, with similar co= st in spec for a 4 disk configuration. After I had a drive fail in the = mirror configuration, I ordered a replacement and crossed my fingers th= at the other half to *that* mirror would not fail (the pairs of drives = in the mirrors were the same make / model bought at the same time ? not= a good bet for reliability). When I got the replacement drive(s) I too= k some time and rebuilt my configuration to better handle growth and re= liability by going from a 4 disk 2-way mirror configuration to a 5 disk= RAIDz2. I went from net about 2TB to net about 3TB capacity and a hot = spare. Yeah, the mirrors never did look to me to be as good an option eit= her. > >If being able to easily grow capacity is the primary goal I would go w= ith a 2-way mirror configuration and always include a hot spare (so tha= t *when* a drive fails it immediately starts resilvering (the ZFS term = for syncing) the vdev). Then you can simple add pairs of drives to add = capacity. Just make sure that the hot spare is at least as large as the= largest drive in use. When you buy drives, always buy from as many dif= ferent manufacturers and models as you can. I just bought four 2TB driv= es for my backup server. One is a WD, the other 3 are HGST but they are= four different model drives, so that they did not come off the same pr= oduction line on the same week as each other. If I could have I would h= ave gotten four different manufacturers. I also only buy server class (= rated for 24x7 operation with 5 year warranty) drives. The additional c= ost has been offset by the savings due to being able to have a failed d= rive replaced under warranty. I'm not familiar with HGST, but I will look into their products. Where does one find the server-class drives for sale? What sort of price difference is there between the server-class and the ordinary drives? And yes, I did run across the silly term used in ZFS for rebuildin= g a drive's contents. :-} > >> (*) [Last year I got two defective 3 TB drives in a row from Seagate= . > >Wow, the only time I have seen that kind of failure rate was buying fr= om Newegg when they were packing them badly. At that time, the shop that was getting them for me (to put into a third-party case with certain interfaces I needed at the time) told me that, of the 3 TB Seagate drives they had gotten for their own use and also for sale to customers who wanted them, only roughly 50% survived p= ast their first 30 days of use, and that none of the Western Digital 3 TB drives had survived that long. I concluded that the 3 TB drives were n= ot yet ready for prime time and should not have been marketed as early as they were. That was the reason for my insisting upon a 2 TB Seagate to fill the third-party case. > >> I ended up settling for a 2 TB Seagate that is still running fine AF= AIK. >> While that process was going on, I bought three 2 TB Seagate drives = in >> external cases with USB 3.0 interfaces, two of which failed outright >> after about 12 months and have been replaced with two refurbished dr= ives >> under warranty. > >Yup, they all replace failed drives with refurb. > >As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 dr= ives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home serve= r. I have had 3 of the Seagates fail (and one of the Seagate replacemen= ts has failed, still under warranty). I have not had any HGST or WD dri= ves fail (and they both have better performance than the Seagates). Thi= s does not mean that I do not buy Seagate drives. I spread my purchases= around, keeping to the 24x7 5 year warranty drives and followup when I= have a failure. I had a WD 1 TB drive fail last year. It was just over three year= s old at the time. > >> While waiting for those replacements to arrive, I bought >> a 2 TB Samsung drive in an external case with a USB 3.0 interface. = I >> discovered by chance that copying very large files to these drives i= s an >> error-prone process. > >I would suspect the USB 3.0 layer problem, but that is just a guess. There has been no evidence to support that conjecture so far. Wha= t the guy at Samsung/Seagate (they appear to be the same company now) tol= d me was that what I described did not mean that the drive was bad, but instead was a common event with large-capacity drives. He seemed to think that the problems were associated with long-running series of wri= te operations, though he had no explanation for that. It seems to me that such errors being considered "normal" for these newer, larger-capacity drives indicates the adoption of a drastically lowered standard of qual= ity, as compared to just a few years ago. And if that is to be the way of d= isks from now on, then self-correcting file systems will soon become the onl= y acceptable file systems for production use outside of scratch areas. > >> A roughly 1.1 TB file on the one surviving external >> Seagate drive from last year's purchase of three, when copied to the >> Samsung drive, showed no I/O errors during the copy operation. Howe= ver, >> a comparison check using "cmp -l -z originalfile copyoforiginal" sho= ws >> quite a few places where the contents don't match. > >ZFS would not tolerate those kinds of errors. On reading the file ZFS = would know via the checksum that the file was bad. And ZFS would attempt to rewrite the bad block(s) with the correct contents? If so, would it then read back what it had written to make sure the errors had, in fact, been corrected on the disk(s)? > >> The same procedure >> applied to one of the refurbished Seagates gives similar results, al= though >> the locations and numbers of differing bytes are different from thos= e on >> the Samsung drive. The same procedure applied to the other refurbis= hed >> drive resulted in a good copy the first time, but a later repetition= ended >> up with a copied file that differed from the original by a single bi= t in >> each of two widely separated places in the files. These problems ha= ve >> raised the priority of a self-healing RAID device in my mind. > >Self healing RAID will be of little help? See more below Why would it be of little help? What you wrote here seems to sugg= est that it would be very helpful, at least for dealing with the kind of tr= ouble that caused me to start this thread. Was the above just a typo of some= kind? > >> I have to say that these are new experiences to me. The disk dr= ives, >> controllers, etc. that I grew up with all had parity checking in the= hardware, >> including the data encoded on the disks, so single-bit errors anywhe= re in >> the process showed up as hardware I/O errors instantly. If the erro= rs were >> not eliminated during a limited number of retries, they ended up as = permanent >> I/O errors that a human would have to resolve at some point. > >What controllers and drives? I have never seen a drive that does NOT h= ave uncorrectable errors (these are undetectable by the drive). I have = also never seen a controller that checksums the data. The controllers r= ely on the drive to report errors. If the drive does not report an erro= r, then the controller trusts the data. Hmm... [scratches head a moment] Well, IBM 1311, 2305, 2311, 2314= , 3330, 3350, 3380, third-party equivalents of those, DEC RA80, Harris di= sks (model numbers forgotten), HP disks (numbers also forgotten), Prime dis= ks (ditto). Maybe some others that escape me now. Tape drives until the early 1990s that I worked with were all 9-track, so each byte was writt= en across 8 data tracks and 1 parity track. Then we got a cartridge-based system, and I *think* it may have been 10-track (i.e., 2 parity tracks)= . Those computers and I/O subsystems and media had a parity bit for each = byte from the CPU and memory all the way out to the oxide on the media. Any= time odd parity was broken, the hardware detected it and passed an indicatio= n of the error back to the operating system. > >The big difference is that with drives under 1TB the odds of running i= nto an uncorrectable error over the life of the drive is very, very sma= ll. The uncorrectable error rate does NOT scale down as the drives scal= e up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in 10e-= 15 (for good drives) for over the past 10 years (when I started looking= at that drive spec). So if the rate is not changing and the total amou= nt of data written / read over the life of the drive join up by, in som= e cases, orders of magnitude, the real world occurrence of such errors = is increasing. Interesting. I wondered if that were all there were to it, rather than the rate per gigabyte increasing due to the increased recording density at the larger capacities. I do think that newer drives have a shorter MTBF than the drives of a decade ago, however. I know that non= e of the ones I've seen fail had served any 300,000+ hours that the manufacturers were citing as MTBF values for their products. One "feature" of many newer drives is an automatic spindown whenever the dr= ive has been inactive for a short time. The heating/cooling cycles that re= sult from these "energy-saving" or "standby" responses look to me like proba= ble culprits for drive failures. The spindowns also mean that such a drive cannot have a paging/swa= pping area on it because the kernel will not wait five to ten seconds (while = the drive spins up) for a pagein to complete. Instead, it will log an erro= r message on the console and will terminate the process that needed the p= age. > >> FWIW, I also discovered that I cannot run two such multi-hour-lo= ng >> copy operations in parallel using two separate pairs of drives. Run= ning >> them together seems to go okay for a while, but eventually always re= sults >> in a panic. This is on 9.2-STABLE (r264339). I know that that is n= ot up >> to date, but I can't do anything about that until my disk hardware s= ituation >> is settled.] > >I have had mixed luck with large copy operations via USB on Freebsd 9.= x Under 9.1 I have found it to be completely unreliable. With 9.2 I hav= e managed without too many errors. USB really does not seem to be a goo= d transport for large quantities of data at fast rates. See my rant on = USB hubs here: http://pk1048.com/usb-beware/ I was referring to kernel panics, not I/O errors. These very long copy operations all complete normally when run serially. The panics occur only when I run two such copies in parallel. > I took a look at that link. I've had good luck with Dynex USB 2.0 hubs, both powered and unpowered, but I've only bought their 4-port hub= s, not the 7-ports. One of mine recently failed after at least five years of service, possibly as long as seven years. However, the only hard drive I currently have connected via USB 2.= 0 is my oldest external drive, an 80 GB WD drive in an iomega case, and I have yet to see any problems with it after nearly ten years of mostly around-the-clock service. The drives showing the errors I've described in this thread are all connected via Connectland 4-port USB 3.0 hubs. I have some other ZFS questions, but this posting is very long already, so I'll post them in a separate thread. Well, thank you very much for your reply. I appreciate the helpfu= l information and perspectives from your actual experiences. There are some capabilities that I would very much like to see added to ZFS in th= e future, but I think I can live with what it can already do right now, a= t least for a few years. The protection against data corruption, especia= lly of the silent type, is something I really, really want, and none of the standard RAID versions seems to offer it, so I guess I'll have to go wi= th raidz and deal with the performance hit and the lack of a "grow" comman= d for raidz for now. Scott Bennett, Comm. ASMELG, CFIAG ********************************************************************** * Internet: bennett at sdf.org *xor* bennett at freeshell.org * *--------------------------------------------------------------------* * "A well regulated and disciplined militia, is at all times a good * * objection to the introduction of that bane of all free governments * * -- a standing army." * * -- Gov. John Hancock, New York Journal, 28 January 1790 * **********************************************************************
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201408020621.s726LsiA024208>