Date: Fri, 20 May 2011 02:11:06 +0200 From: Per von Zweigbergk <pvz@itassistans.se> To: Per von Zweigbergk <pvz@itassistans.se> Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek <pjd@freebsd.org> Subject: Re: HAST + ZFS self healing? Hot spares? Message-ID: <61D2B7A3-1778-4A42-8983-8C325D2F849E@itassistans.se> In-Reply-To: <5B27EAAB-5D23-4844-B7C7-F83289BCABE7@itassistans.se> References: <85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9@itassistans.se> <20110519181436.GB2100@garage.freebsd.pl> <4DD5A1CF.70807@itassistans.se> <20110519230921.GF2100@garage.freebsd.pl> <BANLkTi=1psNnEOFxD1YEmuNAHRDyXBdBfw@mail.gmail.com> <5B27EAAB-5D23-4844-B7C7-F83289BCABE7@itassistans.se>
next in thread | previous in thread | raw e-mail | index | archive | help
20 maj 2011 kl. 01.27 skrev Per von Zweigbergk: > You're describing taking the entire array offline while you perform = work on it. My apologies, I was a bit too quick reading what you (Freddie Cash) = wrote. What you're describing is relying on ZFS's own redundancy while you = replace the failed disk, bringing down the entire HAST resource just so = you can replace one of the two failed disks. The only reason the ZFS = array continues to function is because it's redundant in ZFS itself. Ideally, the HAST resource could continue to remain operational while = the failed disk was replaced. After all, it can remain operational while = the primary disk has failed, and it can remain operational while the = data is being resynchronized, so why would the resource need to be = brought down just to transition between these two states? I guess it's because HAST isn't quite "finished" yet feature-wise, and = that particular feature does not yet exist. Still, I suppose this is good enough, this just shows that raidz:ing = together a bunch of HAST mirrors solves one and a half of my operational = problems - replacing failed drives (by momentarily downing the whole = HAST resource while work is being done) and providing checksumming = capability (although not self-healing). The setup described (a bunch of HAST mirrors in a raidz) will not = self-heal entirely. Imagine if a bit error occurred while writing to one = of the secondary disks. Since that data is never read by ZFS or HAST, = the error would remain undetected. To ensure data integrity on both the = primary and secondary servers, you'd have to failover the servers once = every N days/weeks/months (depending on your operational requirements) = and perform a zfs scrub on "both sides" of the HAST resource, as part of = regular maintenance. It'd probably even be scriptable, assuming you can = live with a few seconds of scheduled downtime during the switchover.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?61D2B7A3-1778-4A42-8983-8C325D2F849E>