From owner-freebsd-fs@freebsd.org Fri Jul 1 17:54:30 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CC919B8EEA5 for ; Fri, 1 Jul 2016 17:54:30 +0000 (UTC) (envelope-from jkh@ixsystems.com) Received: from barracuda.ixsystems.com (barracuda.ixsystems.com [12.229.62.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.ixsystems.com", Issuer "Go Daddy Secure Certificate Authority - G2" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id AF32B2B91 for ; Fri, 1 Jul 2016 17:54:30 +0000 (UTC) (envelope-from jkh@ixsystems.com) X-ASG-Debug-ID: 1467395669-08ca0410fd27910001-3nHGF7 Received: from zimbra.ixsystems.com ([10.246.0.20]) by barracuda.ixsystems.com with ESMTP id yy6nPg2NMc0XjXCx (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 01 Jul 2016 10:54:29 -0700 (PDT) X-Barracuda-Envelope-From: jkh@ixsystems.com X-Barracuda-RBL-Trusted-Forwarder: 10.246.0.20 X-ASG-Whitelist: Client Received: from localhost (localhost [127.0.0.1]) by zimbra.ixsystems.com (Postfix) with ESMTP id 985B9C81853; Fri, 1 Jul 2016 10:54:29 -0700 (PDT) Received: from zimbra.ixsystems.com ([127.0.0.1]) by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 4ouYqCRZXAUe; Fri, 1 Jul 2016 10:54:28 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.ixsystems.com (Postfix) with ESMTP id C08D2C81B10; Fri, 1 Jul 2016 10:54:28 -0700 (PDT) X-Virus-Scanned: amavisd-new at ixsystems.com Received: from zimbra.ixsystems.com ([127.0.0.1]) by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id M-0RIxkVzxFg; Fri, 1 Jul 2016 10:54:28 -0700 (PDT) Received: from [172.20.0.34] (vpn.ixsystems.com [10.249.0.2]) by zimbra.ixsystems.com (Postfix) with ESMTPSA id 917DCC81853; Fri, 1 Jul 2016 10:54:28 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: HAST + ZFS + NFS + CARP From: Jordan Hubbard X-ASG-Orig-Subj: Re: HAST + ZFS + NFS + CARP In-Reply-To: <20160630185701.GD5695@mordor.lan> Date: Fri, 1 Jul 2016 10:54:27 -0700 Cc: Chris Watson , freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <678321AB-A9F7-4890-A8C7-E20DFDC69137@gmail.com> <20160630185701.GD5695@mordor.lan> To: Julien Cigar X-Mailer: Apple Mail (2.3124) X-Barracuda-Connect: UNKNOWN[10.246.0.20] X-Barracuda-Start-Time: 1467395669 X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384 X-Barracuda-URL: https://10.246.0.26:443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at ixsystems.com X-Barracuda-BRTS-Status: 1 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jul 2016 17:54:30 -0000 > On Jun 30, 2016, at 11:57 AM, Julien Cigar = wrote: >=20 > It would be more than welcome indeed..! I have the feeling that HAST > isn't that much used (but maybe I am wrong) and it's difficult to find=20= > informations on it's reliability and concrete long-term use cases... This has been a long discussion so I=E2=80=99m not even sure where the = right place to jump in is, but just speaking as a storage vendor = (FreeNAS) I=E2=80=99ll say that we=E2=80=99ve considered HAST many times = but also rejected it many times for multiple reasons: 1. Blocks which are found to be corrupt by ZFS (fail checksum) get = replicated by HAST nonetheless since it has no idea - it=E2=80=99s below = that layer. This means that both good data and corrupt data are = replicated to the other pool, which isn=E2=80=99t a fatal flaw but = it=E2=80=99s a lot nicer to be replicating only *good* data at a higher = layer. 2. When HAST systems go split-brain, it=E2=80=99s apparently hilarious. = I don=E2=80=99t have any experience with that in production so I can=E2=80= =99t speak authoritatively about it, but the split-brain scenario has = been mentioned by some of the folks who are working on clustered = filesystems (glusterfs, ceph, etc) and I can easily imagine how that = might cause hilarity, given the fact that ZFS has no idea its underlying = block store is being replicated and also likes to commit changes in = terms of transactions (TXGs), not just individual block writes, and = writing a partial TXG (or potentially multiple outstanding TXGs with = varying degrees of completion) would Be Bad. 3. HAST only works on a pair of machines with a MASTER/SLAVE = relationship, which is pretty ghetto by today=E2=80=99s standards. HDFS = (Hadoop=E2=80=99s filesystem) can do block replication across multiple = nodes, as can DRDB (Distributed Replicated Block Device), so chasing = HAST seems pretty retro and will immediately set you up for = embarrassment when the inevitable =E2=80=9COK, that pair of nodes is = fine, but I=E2=80=99d like them both to be active and I=E2=80=99d also = like to add a 3rd node in this one scenario where I want even more = fault-tolerance - other folks can do that, how about you?=E2=80=9D = question comes up. In short, the whole thing sounds kind of MEH and that=E2=80=99s why = we=E2=80=99ve avoided putting any real time or energy into HAST. DRDB = sounds much more interesting, though of course it=E2=80=99s Linux-only. = This wouldn=E2=80=99t stop someone else from implementing a similar = scheme in a clean-room fashion, of course. And yes, of course one can layer additional things on top of iSCSI LUNs, = just as one can punch through LUNs from older SAN fabrics and put ZFS = pools on top of them (been there, done both of those things), though of = course the additional indirection has performance and debugging = ramifications of its own (when a pool goes sideways, you have additional = things in the failure chain to debug). ZFS really likes to =E2=80=9Cown = the disks=E2=80=9D in terms of providing block-level fault tolerance and = predictable performance characteristics given specific vdev topologies, = and once you start abstracting the disks away from it, making statements = about predicted IOPs for the pool becomes something of a =E2=80=9C???=E2=80= =9D exercise. - Jordan