FreeBSD Mail Archives

Date:      Fri, 1 Jul 2016 10:54:27 -0700
From:      Jordan Hubbard <jkh@ixsystems.com>
To:        Julien Cigar <julien@perdition.city>
Cc:        Chris Watson <bsdunix44@gmail.com>, freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <FD296976-0250-4DA7-BB56-68F43B62C19B@ixsystems.com>
In-Reply-To: <20160630185701.GD5695@mordor.lan>
References:  <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <678321AB-A9F7-4890-A8C7-E20DFDC69137@gmail.com> <20160630185701.GD5695@mordor.lan>

> On Jun 30, 2016, at 11:57 AM, Julien Cigar <julien@perdition.city> =
wrote:
>=20
> It would be more than welcome indeed..! I have the feeling that HAST
> isn't that much used (but maybe I am wrong) and it's difficult to find=20=

> informations on it's reliability and concrete long-term use cases...

This has been a long discussion so I=E2=80=99m not even sure where the =
right place to jump in is, but just speaking as a storage vendor =
(FreeNAS) I=E2=80=99ll say that we=E2=80=99ve considered HAST many times =
but also rejected it many times for multiple reasons:

1. Blocks which are found to be corrupt by ZFS (fail checksum) get =
replicated by HAST nonetheless since it has no idea - it=E2=80=99s below =
that layer.  This means that both good data and corrupt data are =
replicated to the other pool, which isn=E2=80=99t a fatal flaw but =
it=E2=80=99s a lot nicer to be replicating only *good* data at a higher =
layer.

2. When HAST systems go split-brain, it=E2=80=99s apparently hilarious.  =
I don=E2=80=99t have any experience with that in production so I can=E2=80=
=99t speak authoritatively about it, but the split-brain scenario has =
been mentioned by some of the folks who are working on clustered =
filesystems (glusterfs, ceph, etc) and I can easily imagine how that =
might cause hilarity, given the fact that ZFS has no idea its underlying =
block store is being replicated and also likes to commit changes in =
terms of transactions (TXGs), not just individual block writes, and =
writing a partial TXG (or potentially multiple outstanding TXGs with =
varying degrees of completion) would Be Bad.

3. HAST only works on a pair of machines with a MASTER/SLAVE =
relationship, which is pretty ghetto by today=E2=80=99s standards.  HDFS =
(Hadoop=E2=80=99s filesystem) can do block replication across multiple =
nodes, as can DRDB (Distributed Replicated Block Device), so chasing =
HAST seems pretty retro and will immediately set you up for =
embarrassment when the inevitable =E2=80=9COK, that pair of nodes is =
fine, but I=E2=80=99d like them both to be active and I=E2=80=99d also =
like to add a 3rd node in this one scenario where I want even more =
fault-tolerance - other folks can do that, how about you?=E2=80=9D =
question comes up.

In short, the whole thing sounds kind of MEH and that=E2=80=99s why =
we=E2=80=99ve avoided putting any real time or energy into HAST.  DRDB =
sounds much more interesting, though of course it=E2=80=99s Linux-only.  =
This wouldn=E2=80=99t stop someone else from implementing a similar =
scheme in a clean-room fashion, of course.

And yes, of course one can layer additional things on top of iSCSI LUNs, =
just as one can punch through LUNs from older SAN fabrics and put ZFS =
pools on top of them (been there, done both of those things), though of =
course the additional indirection has performance and debugging =
ramifications of its own (when a pool goes sideways, you have additional =
things in the failure chain to debug).  ZFS really likes to =E2=80=9Cown =
the disks=E2=80=9D in terms of providing block-level fault tolerance and =
predictable performance characteristics given specific vdev topologies, =
and once you start abstracting the disks away from it, making statements =
about predicted IOPs for the pool becomes something of a =E2=80=9C???=E2=80=
=9D exercise.

- Jordan

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FD296976-0250-4DA7-BB56-68F43B62C19B>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation