Date: Fri, 1 Jul 2016 10:54:27 -0700 From: Jordan Hubbard <jkh@ixsystems.com> To: Julien Cigar <julien@perdition.city> Cc: Chris Watson <bsdunix44@gmail.com>, freebsd-fs@freebsd.org Subject: Re: HAST + ZFS + NFS + CARP Message-ID: <FD296976-0250-4DA7-BB56-68F43B62C19B@ixsystems.com> In-Reply-To: <20160630185701.GD5695@mordor.lan> References: <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <678321AB-A9F7-4890-A8C7-E20DFDC69137@gmail.com> <20160630185701.GD5695@mordor.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Jun 30, 2016, at 11:57 AM, Julien Cigar <julien@perdition.city> = wrote: >=20 > It would be more than welcome indeed..! I have the feeling that HAST > isn't that much used (but maybe I am wrong) and it's difficult to find=20= > informations on it's reliability and concrete long-term use cases... This has been a long discussion so I=E2=80=99m not even sure where the = right place to jump in is, but just speaking as a storage vendor = (FreeNAS) I=E2=80=99ll say that we=E2=80=99ve considered HAST many times = but also rejected it many times for multiple reasons: 1. Blocks which are found to be corrupt by ZFS (fail checksum) get = replicated by HAST nonetheless since it has no idea - it=E2=80=99s below = that layer. This means that both good data and corrupt data are = replicated to the other pool, which isn=E2=80=99t a fatal flaw but = it=E2=80=99s a lot nicer to be replicating only *good* data at a higher = layer. 2. When HAST systems go split-brain, it=E2=80=99s apparently hilarious. = I don=E2=80=99t have any experience with that in production so I can=E2=80= =99t speak authoritatively about it, but the split-brain scenario has = been mentioned by some of the folks who are working on clustered = filesystems (glusterfs, ceph, etc) and I can easily imagine how that = might cause hilarity, given the fact that ZFS has no idea its underlying = block store is being replicated and also likes to commit changes in = terms of transactions (TXGs), not just individual block writes, and = writing a partial TXG (or potentially multiple outstanding TXGs with = varying degrees of completion) would Be Bad. 3. HAST only works on a pair of machines with a MASTER/SLAVE = relationship, which is pretty ghetto by today=E2=80=99s standards. HDFS = (Hadoop=E2=80=99s filesystem) can do block replication across multiple = nodes, as can DRDB (Distributed Replicated Block Device), so chasing = HAST seems pretty retro and will immediately set you up for = embarrassment when the inevitable =E2=80=9COK, that pair of nodes is = fine, but I=E2=80=99d like them both to be active and I=E2=80=99d also = like to add a 3rd node in this one scenario where I want even more = fault-tolerance - other folks can do that, how about you?=E2=80=9D = question comes up. In short, the whole thing sounds kind of MEH and that=E2=80=99s why = we=E2=80=99ve avoided putting any real time or energy into HAST. DRDB = sounds much more interesting, though of course it=E2=80=99s Linux-only. = This wouldn=E2=80=99t stop someone else from implementing a similar = scheme in a clean-room fashion, of course. And yes, of course one can layer additional things on top of iSCSI LUNs, = just as one can punch through LUNs from older SAN fabrics and put ZFS = pools on top of them (been there, done both of those things), though of = course the additional indirection has performance and debugging = ramifications of its own (when a pool goes sideways, you have additional = things in the failure chain to debug). ZFS really likes to =E2=80=9Cown = the disks=E2=80=9D in terms of providing block-level fault tolerance and = predictable performance characteristics given specific vdev topologies, = and once you start abstracting the disks away from it, making statements = about predicted IOPs for the pool becomes something of a =E2=80=9C???=E2=80= =9D exercise. - Jordan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FD296976-0250-4DA7-BB56-68F43B62C19B>