Date: Wed, 18 May 2011 08:13:13 +0200 From: Per von Zweigbergk <pvz@itassistans.se> To: freebsd-fs@freebsd.org Subject: HAST + ZFS self healing? Hot spares? Message-ID: <85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9@itassistans.se>
next in thread | raw e-mail | index | archive | help
I've been investigating HAST as a possibility in adding synchronous = replication and failover to a set of two NFS servers backed by NFS. The = servers themselves contain quite a few disks. 20 of them (7200 RPM SAS = disks), to be exact. (If I didn't lose count again...) Plus two quick = but small SSD's for ZIL and two not-as-quick but larger SSD's for L2ARC. These machines weren't originally designed with synchronous replication = in mind - they were designed to be NFS file servers (used as VMware data = stores) backed by ZFS. They contain LSI MegaRaid 9260 controllers (as an = aside, these were perhaps not the best choice for ZFS since they lack a = true JBOD mode, I have worked around this by making single-disk RAID-0 = arrays, and then using those single-disk arrays to make up the zpool). Now, I've been considering making an active/passive (or, possibly, = active/passive + passive/active) synchronously replicated pair of = servers out of these, and my eyes fall on HAST. Initially, my thoughts land on simply creating HAST resources for the = corresponding pairs of disks and SSDs in servers A and B, and then using = these HAST resources to make up the ZFS pool. But this raises two questions: --- 1. Hardware failure management. In case of a hardware failure, I'm not = exactly sure what will happen, but I suspect the single-disk RAID-0 = array containing the failed disk will simply fail. I assume it will = still exist, but refuse to be read or written. In this situation I = understand HAST will handle this by routing all I/O to the secondary = server, in case the disk on the primary side dies, or simply by cutting = off replication if the disk on the secondary server fails. I have not seen any "hot spare" mechanism in HAST, but I would think = that I could edit the cluster configuration file to manually configure a = hot spare in case I receive an alert. Would I have to restart all of = hastd to do this, though? Or is it sufficient to bring the resource into = init and back into secondary using hastctl? Of course it may just be infinitely simpler just to configure spares on = the ZFS level, and keep entire spare hast resources, and just do a zfs = replace, replacing an entire array of two disks whenever one of the = disks in an array fails. Still, it would be know what I can reconfigure = on-the-fly with hast itself. --- 2. ZFS self-healing. As far as I understand it, ZFS does self-healing, = in that all data is checksummed, and if one disk in a mirror happens to = contain corrupted data, ZFS will re-read the same data from the other = disk in the ZFS mirror. I don't see any way this could work in a = configuration where ZFS is not mirroring itself, but rather, running on = top of HAST, currently. Am I wrong about this? Or is there any way to = achieve this same self-healing effect except with HAST? --- So, what is it, do I have to give up ZFS's self healing (one of the = really neat features in ZFS) if I go for HAST? Of course, I could mirror = the drives first with HAST, and then mirror the two HAST mirrors using a = zfs mirror, but that would be wasteful and a little silly. I might even = be able to get away with using "copies=3D2" in this scenario. Or I could = use raid-z on top of the mirrors, wasting less disk, but causing a = performance hit. I mean, ideally, ZFS would have a really neat synchronous replication = feature built into it. Or ZFS could be HAST-aware, and know how to ask = HAST to bring it a copy of a block of data on the remote block device in = a HAST mirror in case the checksum on the local block device doesn't = match. Or HAST would itself have some kind of block-level checksums, and = do self-healing itself. (This would probably be the easiest to = implement. The secondary site could even continually be reading the same = data as the primary site is, merely to check the checksums on disk, not = to send it over the wire. It's not like it's doing anything else useful = with that untapped read performance.) So, what's the current state of solving this problem? Is there any work = being done in this area? Have I overlooked some technology I might use = to achieve this goal?=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9>