Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Jun 2016 23:35:49 +0200
From:      Ben RUBSON <ben.rubson@gmail.com>
To:        freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com>
In-Reply-To: <20160630163541.GC5695@mordor.lan>
References:  <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 30 Jun 2016, at 18:35, Julien Cigar <julien@perdition.city> wrote:
>=20
> On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote:
>>=20
>>=20
>>> On 30 Jun 2016, at 17:37, Julien Cigar <julien@perdition.city> =
wrote:
>>>=20
>>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote:
>>>>=20
>>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter =
<jg@internetx.com> wrote:
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar:
>>>>>> Hello,
>>>>>>=20
>>>>>> I'm always in the process of setting a redundant low-cost storage =
for=20
>>>>>> our (small, ~30 people) team here.
>>>>>>=20
>>>>>> I read quite a lot of articles/documentations/etc and I plan to =
use HAST
>>>>>> with ZFS for the storage, CARP for the failover and the "good old =
NFS"
>>>>>> to mount the shares on the clients.
>>>>>>=20
>>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks =
for the
>>>>>> shared storage.
>>>>>>=20
>>>>>> Assuming the following configuration:
>>>>>> - MASTER is the active node and BACKUP is the standby node.
>>>>>> - two disks in each machine: ada0 and ada1.
>>>>>> - two interfaces in each machine: em0 and em1
>>>>>> - em0 is the primary interface (with CARP setup)
>>>>>> - em1 is dedicated to the HAST traffic (crossover cable)
>>>>>> - FreeBSD is properly installed in each machine.
>>>>>> - a HAST resource "disk0" for ada0p2.
>>>>>> - a HAST resource "disk1" for ada1p2.
>>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is =
created
>>>>>> on MASTER
>>>>>>=20
>>>>>> A couple of questions I am still wondering:
>>>>>> - If a disk dies on the MASTER I guess that zpool will not see it =
and
>>>>>> will transparently use the one on BACKUP through the HAST =
ressource..
>>>>>=20
>>>>> thats right, as long as writes on $anything have been successful =
hast is
>>>>> happy and wont start whining
>>>>>=20
>>>>>> is it a problem?=20
>>>>>=20
>>>>> imho yes, at least from management view
>>>>>=20
>>>>>> could this lead to some corruption?
>>>>>=20
>>>>> probably, i never heard about anyone who uses that for long time =
in
>>>>> production
>>>>>=20
>>>>> At this stage the
>>>>>> common sense would be to replace the disk quickly, but imagine =
the
>>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see =
it=20
>>>>>> and will transparently use the one from the BACKUP node (through =
the=20
>>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will =
not=20
>>>>>> see it and will transparently use the one from the BACKUP node=20
>>>>>> (through the "disk0" HAST ressource). At this point on MASTER the =
two=20
>>>>>> disks are broken but the pool is still considered healthy ... =
What if=20
>>>>>> after that we unplug the em0 network cable on BACKUP? Storage is
>>>>>> down..
>>>>>> - Under heavy I/O the MASTER box suddently dies (for some =
reasons),=20
>>>>>> thanks to CARP the BACKUP node will switch from standy -> active =
and=20
>>>>>> execute the failover script which does some "hastctl role =
primary" for
>>>>>> the ressources and a zpool import. I wondered if there are any
>>>>>> situations where the pool couldn't be imported (=3D data =
corruption)?
>>>>>> For example what if the pool hasn't been exported on the MASTER =
before
>>>>>> it dies?
>>>>>> - Is it a problem if the NFS daemons are started at boot on the =
standby
>>>>>> node, or should they only be started in the failover script? What
>>>>>> about stale files and active connections on the clients?
>>>>>=20
>>>>> sometimes stale mounts recover, sometimes not, sometimes clients =
need
>>>>> even reboots
>>>>>=20
>>>>>> - A catastrophic power failure occur and MASTER and BACKUP are =
suddently
>>>>>> powered down. Later the power returns, is it possible that some
>>>>>> problem occur (split-brain scenario ?) regarding the order in =
which the
>>>>>=20
>>>>> sure, you need an exact procedure to recover
>>>>>=20
>>>>>> two machines boot up?
>>>>>=20
>>>>> best practice should be to keep everything down after boot
>>>>>=20
>>>>>> - Other things I have not thought?
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>> Thanks!
>>>>>> Julien
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>> imho:
>>>>>=20
>>>>> leave hast where it is, go for zfs replication. will save your =
butt,
>>>>> sooner or later if you avoid this fragile combination
>>>>=20
>>>> I was also replying, and finishing by this :
>>>> Why don't you set your slave as an iSCSI target and simply do ZFS =
mirroring ?
>>>=20
>>> Yes that's another option, so a zpool with two mirrors (local +=20
>>> exported iSCSI) ?
>>=20
>> Yes, you would then have a real time replication solution (as HAST), =
compared to ZFS send/receive which is not.
>> Depends on what you need :)
>=20
> More a real time replication solution in fact ... :)
> Do you have any resource which resume all the pro(s) and con(s) of =
HAST
> vs iSCSI ? I have found a lot of article on ZFS + HAST but not that =
much
> with ZFS + iSCSI ..=20

# No resources, but some ideas :

- ZFS likes to see all the details of its underlying disks, which is =
possible with local disks (of course) and iSCSI disks, not with HAST.
- iSCSI solution is simpler, you only have ZFS to manage, your =
replication is made by ZFS itself, not by an additional stack.
- HAST does not seem to be really maintained (I may be wrong), at least =
compared to DRBD HAST seems to be inspired from.
- You do not have to cross your fingers when you promote your slave to =
master ("will ZFS be happy with my HAST replicated disks ?"), ZFS =
mirrored data by itself, you only have to import [-f].

- (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI =
could require more administration after a disconnection. But this could =
easily be done by a script.

# Some "advices" based on my findings (I'm finishing my tests of such a =
solution) :

Write performance will suffer from network latency, but while your 2 =
nodes are in the same room, that should be OK.
If you are over a long distance link, you may add several ms to each =
write IO, which, depending on the use case, may be wrong, ZFS may also =
be unresponsive.
Max throughput is also more difficult to achieve over a high latency =
link.

You will have to choose network cards depending on the number of disks =
and their throughput.
For example, if you need to resilver a SATA disk (180MB/s), then a =
simple 1GB interface (120MB/s) will be a serious bottleneck.
Think about scrub too.

You should have to perform some network tuning (TCP window size, jumbo =
frame...) to reach your max bandwidth.
Trying to saturate network link with (for example) iPerf before dealing =
with iSCSI seems to be a good thing.

Here are some interesting sysctl so that ZFS will not hang (too long) in =
case of an unreachable iSCSI disk :
kern.iscsi.ping_timeout=3D5
kern.iscsi.iscsid_timeout=3D5
kern.iscsi.login_timeout=3D5
kern.iscsi.fail_on_disconnection=3D1
(adjust the 5 seconds depending on your needs / on your network =
quality).

Take care when you (auto)replace disks, you may replace an iSCSI disk =
with a local disk, which of course would work but would be wrong in =
terms of master/slave redundancy.
Use nice labels on your disks so that if you have a lot of disks in your =
pool, you quickly know which one is local, which one is remote.

# send/receive pro(s) :

In terms of data safety, one of the interests of ZFS send/receive is =
that you have a totally different target pool, which can be interesting =
if ever you have a disaster with your primary pool.
As a 3rd node solution ? On another site ? (as send/receive does not =
suffer as iSCSI would from latency)

>>>> ZFS would then know as soon as a disk is failing.
>>>> And if the master fails, you only have to import (-f certainly, in =
case of a master power failure) on the slave.
>>>>=20
>>>> Ben



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50BF1AEF-3ECC-4C30-B8E1-678E02735BB5>