From owner-freebsd-fs@freebsd.org  Fri Jul  1 17:54:30 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CC919B8EEA5
 for <freebsd-fs@mailman.ysv.freebsd.org>; Fri,  1 Jul 2016 17:54:30 +0000 (UTC)
 (envelope-from jkh@ixsystems.com)
Received: from barracuda.ixsystems.com (barracuda.ixsystems.com [12.229.62.30])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.ixsystems.com",
 Issuer "Go Daddy Secure Certificate Authority - G2" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id AF32B2B91
 for <freebsd-fs@freebsd.org>; Fri,  1 Jul 2016 17:54:30 +0000 (UTC)
 (envelope-from jkh@ixsystems.com)
X-ASG-Debug-ID: 1467395669-08ca0410fd27910001-3nHGF7
Received: from zimbra.ixsystems.com ([10.246.0.20]) by barracuda.ixsystems.com
 with ESMTP id yy6nPg2NMc0XjXCx (version=TLSv1.2
 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 01 Jul 2016 10:54:29 -0700 (PDT)
X-Barracuda-Envelope-From: jkh@ixsystems.com
X-Barracuda-RBL-Trusted-Forwarder: 10.246.0.20
X-ASG-Whitelist: Client
Received: from localhost (localhost [127.0.0.1])
 by zimbra.ixsystems.com (Postfix) with ESMTP id 985B9C81853;
 Fri,  1 Jul 2016 10:54:29 -0700 (PDT)
Received: from zimbra.ixsystems.com ([127.0.0.1])
 by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id 4ouYqCRZXAUe; Fri,  1 Jul 2016 10:54:28 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.ixsystems.com (Postfix) with ESMTP id C08D2C81B10;
 Fri,  1 Jul 2016 10:54:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at ixsystems.com
Received: from zimbra.ixsystems.com ([127.0.0.1])
 by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id M-0RIxkVzxFg; Fri,  1 Jul 2016 10:54:28 -0700 (PDT)
Received: from [172.20.0.34] (vpn.ixsystems.com [10.249.0.2])
 by zimbra.ixsystems.com (Postfix) with ESMTPSA id 917DCC81853;
 Fri,  1 Jul 2016 10:54:28 -0700 (PDT)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: HAST + ZFS + NFS + CARP
From: Jordan Hubbard <jkh@ixsystems.com>
X-ASG-Orig-Subj: Re: HAST + ZFS + NFS + CARP
In-Reply-To: <20160630185701.GD5695@mordor.lan>
Date: Fri, 1 Jul 2016 10:54:27 -0700
Cc: Chris Watson <bsdunix44@gmail.com>,
 freebsd-fs@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <FD296976-0250-4DA7-BB56-68F43B62C19B@ixsystems.com>
References: <20160630144546.GB99997@mordor.lan>
 <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com>
 <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com>
 <20160630153747.GB5695@mordor.lan>
 <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com>
 <678321AB-A9F7-4890-A8C7-E20DFDC69137@gmail.com>
 <20160630185701.GD5695@mordor.lan>
To: Julien Cigar <julien@perdition.city>
X-Mailer: Apple Mail (2.3124)
X-Barracuda-Connect: UNKNOWN[10.246.0.20]
X-Barracuda-Start-Time: 1467395669
X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384
X-Barracuda-URL: https://10.246.0.26:443/cgi-mod/mark.cgi
X-Virus-Scanned: by bsmtpd at ixsystems.com
X-Barracuda-BRTS-Status: 1
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jul 2016 17:54:30 -0000


> On Jun 30, 2016, at 11:57 AM, Julien Cigar <julien@perdition.city> =
wrote:
>=20
> It would be more than welcome indeed..! I have the feeling that HAST
> isn't that much used (but maybe I am wrong) and it's difficult to find=20=

> informations on it's reliability and concrete long-term use cases...

This has been a long discussion so I=E2=80=99m not even sure where the =
right place to jump in is, but just speaking as a storage vendor =
(FreeNAS) I=E2=80=99ll say that we=E2=80=99ve considered HAST many times =
but also rejected it many times for multiple reasons:

1. Blocks which are found to be corrupt by ZFS (fail checksum) get =
replicated by HAST nonetheless since it has no idea - it=E2=80=99s below =
that layer.  This means that both good data and corrupt data are =
replicated to the other pool, which isn=E2=80=99t a fatal flaw but =
it=E2=80=99s a lot nicer to be replicating only *good* data at a higher =
layer.

2. When HAST systems go split-brain, it=E2=80=99s apparently hilarious.  =
I don=E2=80=99t have any experience with that in production so I can=E2=80=
=99t speak authoritatively about it, but the split-brain scenario has =
been mentioned by some of the folks who are working on clustered =
filesystems (glusterfs, ceph, etc) and I can easily imagine how that =
might cause hilarity, given the fact that ZFS has no idea its underlying =
block store is being replicated and also likes to commit changes in =
terms of transactions (TXGs), not just individual block writes, and =
writing a partial TXG (or potentially multiple outstanding TXGs with =
varying degrees of completion) would Be Bad.

3. HAST only works on a pair of machines with a MASTER/SLAVE =
relationship, which is pretty ghetto by today=E2=80=99s standards.  HDFS =
(Hadoop=E2=80=99s filesystem) can do block replication across multiple =
nodes, as can DRDB (Distributed Replicated Block Device), so chasing =
HAST seems pretty retro and will immediately set you up for =
embarrassment when the inevitable =E2=80=9COK, that pair of nodes is =
fine, but I=E2=80=99d like them both to be active and I=E2=80=99d also =
like to add a 3rd node in this one scenario where I want even more =
fault-tolerance - other folks can do that, how about you?=E2=80=9D =
question comes up.

In short, the whole thing sounds kind of MEH and that=E2=80=99s why =
we=E2=80=99ve avoided putting any real time or energy into HAST.  DRDB =
sounds much more interesting, though of course it=E2=80=99s Linux-only.  =
This wouldn=E2=80=99t stop someone else from implementing a similar =
scheme in a clean-room fashion, of course.

And yes, of course one can layer additional things on top of iSCSI LUNs, =
just as one can punch through LUNs from older SAN fabrics and put ZFS =
pools on top of them (been there, done both of those things), though of =
course the additional indirection has performance and debugging =
ramifications of its own (when a pool goes sideways, you have additional =
things in the failure chain to debug).  ZFS really likes to =E2=80=9Cown =
the disks=E2=80=9D in terms of providing block-level fault tolerance and =
predictable performance characteristics given specific vdev topologies, =
and once you start abstracting the disks away from it, making statements =
about predicted IOPs for the pool becomes something of a =E2=80=9C???=E2=80=
=9D exercise.

- Jordan