From owner-freebsd-fs@freebsd.org  Thu Jun 30 21:35:54 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BF966B87241
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 30 Jun 2016 21:35:54 +0000 (UTC)
 (envelope-from ben.rubson@gmail.com)
Received: from mail-wm0-x235.google.com (mail-wm0-x235.google.com
 [IPv6:2a00:1450:400c:c09::235])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 563DD20E5
 for <freebsd-fs@freebsd.org>; Thu, 30 Jun 2016 21:35:54 +0000 (UTC)
 (envelope-from ben.rubson@gmail.com)
Received: by mail-wm0-x235.google.com with SMTP id a66so3845105wme.0
 for <freebsd-fs@freebsd.org>; Thu, 30 Jun 2016 14:35:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:subject:from:in-reply-to:date
 :content-transfer-encoding:message-id:references:to;
 bh=psNyJwT1DbTKzCpgAUJmb/Q/KogSgg99tG9MgWIEGzg=;
 b=XhTan6asp7sfSX69u5/8vqVejMUXsl2geOsVaQwFcBkSGoV8pT0wslDCBHkVbW8R/i
 lwr8yzpSxeIOUjqycE6FUqj5iPKVx1VYbeZ2EJW+oBkkhtuCMxnQJDjvpqboux+gnG4N
 06inja6o+epkTAAx+MK0g7V8qa23VRjXeRMiTKPqJnR2BznJKhg0wdaNSMRByATfUG9A
 S2hW/0WuCfDoXwiRoMyOPLTIwsQGvtERB/luTyJmzDpa5u0S1JVQA8AehRsP89A0uB1D
 6LSxxDklWI84AQOLoTF4+xY5/Hefd1PQwZajB1NmrfZlE8PB84lOmIwOOklSoOBXErEL
 XvPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date
 :content-transfer-encoding:message-id:references:to;
 bh=psNyJwT1DbTKzCpgAUJmb/Q/KogSgg99tG9MgWIEGzg=;
 b=JE1/TU7NaREA4A0EPaUyE0Dw2LzkRHtQKE1hw/4ZZwFSAeDn8pL7UtnsWDHGE1ITFk
 +FaEIMO8DNeVDsFlP4wfLZaOMT3pqVUo0pmy7s8tZTevHH7IeN+8ZOSFLflZI3FDElRo
 x/uG65CWFALANt4XRMb4LX4F85QIlbM09wDs8l/YnDCw5TrQYVpGn9iq5cIaZE0WX2R9
 1It3U6tyY2ZaEcsUNFdc6AiYeiK5XSROoRT1nb/1uezR3XjSVwE0rl4x/I2AXc8nkxl1
 1XgQBOLXwembNT2o6d6PjsX4N9DWtwEDl59MhivrjUCvcMWXbmelfCz+wW8ds2bCI8XO
 dP2A==
X-Gm-Message-State: ALyK8tJjU0sIAo2iWqC9Xxof5bDyzRdB+zkSu/pyE2O2vpXpznKhDjrKp+18x6mm6SasxQ==
X-Received: by 10.28.229.147 with SMTP id c141mr30450520wmh.5.1467322551701;
 Thu, 30 Jun 2016 14:35:51 -0700 (PDT)
Received: from macbook-air-de-benjamin-1.home
 (LFbn-1-7077-85.w90-116.abo.wanadoo.fr. [90.116.246.85])
 by smtp.gmail.com with ESMTPSA id k6sm5262286wjz.28.2016.06.30.14.35.50
 for <freebsd-fs@freebsd.org>
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Thu, 30 Jun 2016 14:35:50 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: HAST + ZFS + NFS + CARP
From: Ben RUBSON <ben.rubson@gmail.com>
In-Reply-To: <20160630163541.GC5695@mordor.lan>
Date: Thu, 30 Jun 2016 23:35:49 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com>
References: <20160630144546.GB99997@mordor.lan>
 <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com>
 <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com>
 <20160630153747.GB5695@mordor.lan>
 <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com>
 <20160630163541.GC5695@mordor.lan>
To: freebsd-fs@freebsd.org
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jun 2016 21:35:54 -0000


> On 30 Jun 2016, at 18:35, Julien Cigar <julien@perdition.city> wrote:
>=20
> On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote:
>>=20
>>=20
>>> On 30 Jun 2016, at 17:37, Julien Cigar <julien@perdition.city> =
wrote:
>>>=20
>>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote:
>>>>=20
>>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter =
<jg@internetx.com> wrote:
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar:
>>>>>> Hello,
>>>>>>=20
>>>>>> I'm always in the process of setting a redundant low-cost storage =
for=20
>>>>>> our (small, ~30 people) team here.
>>>>>>=20
>>>>>> I read quite a lot of articles/documentations/etc and I plan to =
use HAST
>>>>>> with ZFS for the storage, CARP for the failover and the "good old =
NFS"
>>>>>> to mount the shares on the clients.
>>>>>>=20
>>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks =
for the
>>>>>> shared storage.
>>>>>>=20
>>>>>> Assuming the following configuration:
>>>>>> - MASTER is the active node and BACKUP is the standby node.
>>>>>> - two disks in each machine: ada0 and ada1.
>>>>>> - two interfaces in each machine: em0 and em1
>>>>>> - em0 is the primary interface (with CARP setup)
>>>>>> - em1 is dedicated to the HAST traffic (crossover cable)
>>>>>> - FreeBSD is properly installed in each machine.
>>>>>> - a HAST resource "disk0" for ada0p2.
>>>>>> - a HAST resource "disk1" for ada1p2.
>>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is =
created
>>>>>> on MASTER
>>>>>>=20
>>>>>> A couple of questions I am still wondering:
>>>>>> - If a disk dies on the MASTER I guess that zpool will not see it =
and
>>>>>> will transparently use the one on BACKUP through the HAST =
ressource..
>>>>>=20
>>>>> thats right, as long as writes on $anything have been successful =
hast is
>>>>> happy and wont start whining
>>>>>=20
>>>>>> is it a problem?=20
>>>>>=20
>>>>> imho yes, at least from management view
>>>>>=20
>>>>>> could this lead to some corruption?
>>>>>=20
>>>>> probably, i never heard about anyone who uses that for long time =
in
>>>>> production
>>>>>=20
>>>>> At this stage the
>>>>>> common sense would be to replace the disk quickly, but imagine =
the
>>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see =
it=20
>>>>>> and will transparently use the one from the BACKUP node (through =
the=20
>>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will =
not=20
>>>>>> see it and will transparently use the one from the BACKUP node=20
>>>>>> (through the "disk0" HAST ressource). At this point on MASTER the =
two=20
>>>>>> disks are broken but the pool is still considered healthy ... =
What if=20
>>>>>> after that we unplug the em0 network cable on BACKUP? Storage is
>>>>>> down..
>>>>>> - Under heavy I/O the MASTER box suddently dies (for some =
reasons),=20
>>>>>> thanks to CARP the BACKUP node will switch from standy -> active =
and=20
>>>>>> execute the failover script which does some "hastctl role =
primary" for
>>>>>> the ressources and a zpool import. I wondered if there are any
>>>>>> situations where the pool couldn't be imported (=3D data =
corruption)?
>>>>>> For example what if the pool hasn't been exported on the MASTER =
before
>>>>>> it dies?
>>>>>> - Is it a problem if the NFS daemons are started at boot on the =
standby
>>>>>> node, or should they only be started in the failover script? What
>>>>>> about stale files and active connections on the clients?
>>>>>=20
>>>>> sometimes stale mounts recover, sometimes not, sometimes clients =
need
>>>>> even reboots
>>>>>=20
>>>>>> - A catastrophic power failure occur and MASTER and BACKUP are =
suddently
>>>>>> powered down. Later the power returns, is it possible that some
>>>>>> problem occur (split-brain scenario ?) regarding the order in =
which the
>>>>>=20
>>>>> sure, you need an exact procedure to recover
>>>>>=20
>>>>>> two machines boot up?
>>>>>=20
>>>>> best practice should be to keep everything down after boot
>>>>>=20
>>>>>> - Other things I have not thought?
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>> Thanks!
>>>>>> Julien
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>> imho:
>>>>>=20
>>>>> leave hast where it is, go for zfs replication. will save your =
butt,
>>>>> sooner or later if you avoid this fragile combination
>>>>=20
>>>> I was also replying, and finishing by this :
>>>> Why don't you set your slave as an iSCSI target and simply do ZFS =
mirroring ?
>>>=20
>>> Yes that's another option, so a zpool with two mirrors (local +=20
>>> exported iSCSI) ?
>>=20
>> Yes, you would then have a real time replication solution (as HAST), =
compared to ZFS send/receive which is not.
>> Depends on what you need :)
>=20
> More a real time replication solution in fact ... :)
> Do you have any resource which resume all the pro(s) and con(s) of =
HAST
> vs iSCSI ? I have found a lot of article on ZFS + HAST but not that =
much
> with ZFS + iSCSI ..=20

# No resources, but some ideas :

- ZFS likes to see all the details of its underlying disks, which is =
possible with local disks (of course) and iSCSI disks, not with HAST.
- iSCSI solution is simpler, you only have ZFS to manage, your =
replication is made by ZFS itself, not by an additional stack.
- HAST does not seem to be really maintained (I may be wrong), at least =
compared to DRBD HAST seems to be inspired from.
- You do not have to cross your fingers when you promote your slave to =
master ("will ZFS be happy with my HAST replicated disks ?"), ZFS =
mirrored data by itself, you only have to import [-f].

- (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI =
could require more administration after a disconnection. But this could =
easily be done by a script.

# Some "advices" based on my findings (I'm finishing my tests of such a =
solution) :

Write performance will suffer from network latency, but while your 2 =
nodes are in the same room, that should be OK.
If you are over a long distance link, you may add several ms to each =
write IO, which, depending on the use case, may be wrong, ZFS may also =
be unresponsive.
Max throughput is also more difficult to achieve over a high latency =
link.

You will have to choose network cards depending on the number of disks =
and their throughput.
For example, if you need to resilver a SATA disk (180MB/s), then a =
simple 1GB interface (120MB/s) will be a serious bottleneck.
Think about scrub too.

You should have to perform some network tuning (TCP window size, jumbo =
frame...) to reach your max bandwidth.
Trying to saturate network link with (for example) iPerf before dealing =
with iSCSI seems to be a good thing.

Here are some interesting sysctl so that ZFS will not hang (too long) in =
case of an unreachable iSCSI disk :
kern.iscsi.ping_timeout=3D5
kern.iscsi.iscsid_timeout=3D5
kern.iscsi.login_timeout=3D5
kern.iscsi.fail_on_disconnection=3D1
(adjust the 5 seconds depending on your needs / on your network =
quality).

Take care when you (auto)replace disks, you may replace an iSCSI disk =
with a local disk, which of course would work but would be wrong in =
terms of master/slave redundancy.
Use nice labels on your disks so that if you have a lot of disks in your =
pool, you quickly know which one is local, which one is remote.

# send/receive pro(s) :

In terms of data safety, one of the interests of ZFS send/receive is =
that you have a totally different target pool, which can be interesting =
if ever you have a disaster with your primary pool.
As a 3rd node solution ? On another site ? (as send/receive does not =
suffer as iSCSI would from latency)

>>>> ZFS would then know as soon as a disk is failing.
>>>> And if the master fails, you only have to import (-f certainly, in =
case of a master power failure) on the slave.
>>>>=20
>>>> Ben