Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Jul 2016 11:19:08 +0200
From:      Ben RUBSON <ben.rubson@gmail.com>
To:        freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <26A31227-B71D-4854-B046-61CD3449E442@gmail.com>
In-Reply-To: <20160701084717.GE5695@mordor.lan>
References:  <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan> <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com> <20160701084717.GE5695@mordor.lan>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 01 Jul 2016, at 10:47, Julien Cigar <julien@perdition.city> wrote:
>=20
> On Thu, Jun 30, 2016 at 11:35:49PM +0200, Ben RUBSON wrote:
>>=20
>>> On 30 Jun 2016, at 18:35, Julien Cigar <julien@perdition.city> =
wrote:
>>>=20
>>> On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote:
>>>>=20
>>>>=20
>>>>> On 30 Jun 2016, at 17:37, Julien Cigar <julien@perdition.city> =
wrote:
>>>>>=20
>>>>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote:
>>>>>>=20
>>>>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter =
<jg@internetx.com> wrote:
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar:
>>>>>>>> Hello,
>>>>>>>>=20
>>>>>>>> I'm always in the process of setting a redundant low-cost =
storage for=20
>>>>>>>> our (small, ~30 people) team here.
>>>>>>>>=20
>>>>>>>> I read quite a lot of articles/documentations/etc and I plan to =
use HAST
>>>>>>>> with ZFS for the storage, CARP for the failover and the "good =
old NFS"
>>>>>>>> to mount the shares on the clients.
>>>>>>>>=20
>>>>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks =
for the
>>>>>>>> shared storage.
>>>>>>>>=20
>>>>>>>> Assuming the following configuration:
>>>>>>>> - MASTER is the active node and BACKUP is the standby node.
>>>>>>>> - two disks in each machine: ada0 and ada1.
>>>>>>>> - two interfaces in each machine: em0 and em1
>>>>>>>> - em0 is the primary interface (with CARP setup)
>>>>>>>> - em1 is dedicated to the HAST traffic (crossover cable)
>>>>>>>> - FreeBSD is properly installed in each machine.
>>>>>>>> - a HAST resource "disk0" for ada0p2.
>>>>>>>> - a HAST resource "disk1" for ada1p2.
>>>>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 =
is created
>>>>>>>> on MASTER
>>>>>>>>=20
>>>>>>>> A couple of questions I am still wondering:
>>>>>>>> - If a disk dies on the MASTER I guess that zpool will not see =
it and
>>>>>>>> will transparently use the one on BACKUP through the HAST =
ressource..
>>>>>>>=20
>>>>>>> thats right, as long as writes on $anything have been successful =
hast is
>>>>>>> happy and wont start whining
>>>>>>>=20
>>>>>>>> is it a problem?=20
>>>>>>>=20
>>>>>>> imho yes, at least from management view
>>>>>>>=20
>>>>>>>> could this lead to some corruption?
>>>>>>>=20
>>>>>>> probably, i never heard about anyone who uses that for long time =
in
>>>>>>> production
>>>>>>>=20
>>>>>>> At this stage the
>>>>>>>> common sense would be to replace the disk quickly, but imagine =
the
>>>>>>>> worst case scenario where ada1 on MASTER dies, zpool will not =
see it=20
>>>>>>>> and will transparently use the one from the BACKUP node =
(through the=20
>>>>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will =
not=20
>>>>>>>> see it and will transparently use the one from the BACKUP node=20=

>>>>>>>> (through the "disk0" HAST ressource). At this point on MASTER =
the two=20
>>>>>>>> disks are broken but the pool is still considered healthy ... =
What if=20
>>>>>>>> after that we unplug the em0 network cable on BACKUP? Storage =
is
>>>>>>>> down..
>>>>>>>> - Under heavy I/O the MASTER box suddently dies (for some =
reasons),=20
>>>>>>>> thanks to CARP the BACKUP node will switch from standy -> =
active and=20
>>>>>>>> execute the failover script which does some "hastctl role =
primary" for
>>>>>>>> the ressources and a zpool import. I wondered if there are any
>>>>>>>> situations where the pool couldn't be imported (=3D data =
corruption)?
>>>>>>>> For example what if the pool hasn't been exported on the MASTER =
before
>>>>>>>> it dies?
>>>>>>>> - Is it a problem if the NFS daemons are started at boot on the =
standby
>>>>>>>> node, or should they only be started in the failover script? =
What
>>>>>>>> about stale files and active connections on the clients?
>>>>>>>=20
>>>>>>> sometimes stale mounts recover, sometimes not, sometimes clients =
need
>>>>>>> even reboots
>>>>>>>=20
>>>>>>>> - A catastrophic power failure occur and MASTER and BACKUP are =
suddently
>>>>>>>> powered down. Later the power returns, is it possible that some
>>>>>>>> problem occur (split-brain scenario ?) regarding the order in =
which the
>>>>>>>=20
>>>>>>> sure, you need an exact procedure to recover
>>>>>>>=20
>>>>>>>> two machines boot up?
>>>>>>>=20
>>>>>>> best practice should be to keep everything down after boot
>>>>>>>=20
>>>>>>>> - Other things I have not thought?
>>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>>> Thanks!
>>>>>>>> Julien
>>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>> imho:
>>>>>>>=20
>>>>>>> leave hast where it is, go for zfs replication. will save your =
butt,
>>>>>>> sooner or later if you avoid this fragile combination
>>>>>>=20
>>>>>> I was also replying, and finishing by this :
>>>>>> Why don't you set your slave as an iSCSI target and simply do ZFS =
mirroring ?
>>>>>=20
>>>>> Yes that's another option, so a zpool with two mirrors (local +=20
>>>>> exported iSCSI) ?
>>>>=20
>>>> Yes, you would then have a real time replication solution (as =
HAST), compared to ZFS send/receive which is not.
>>>> Depends on what you need :)
>>>=20
>>> More a real time replication solution in fact ... :)
>>> Do you have any resource which resume all the pro(s) and con(s) of =
HAST
>>> vs iSCSI ? I have found a lot of article on ZFS + HAST but not that =
much
>>> with ZFS + iSCSI ..=20
>>=20
>> # No resources, but some ideas :
>>=20
>> - ZFS likes to see all the details of its underlying disks, which is =
possible with local disks (of course) and iSCSI disks, not with HAST.
>> - iSCSI solution is simpler, you only have ZFS to manage, your =
replication is made by ZFS itself, not by an additional stack.
>> - HAST does not seem to be really maintained (I may be wrong), at =
least compared to DRBD HAST seems to be inspired from.
>> - You do not have to cross your fingers when you promote your slave =
to master ("will ZFS be happy with my HAST replicated disks ?"), ZFS =
mirrored data by itself, you only have to import [-f].
>>=20
>> - (auto)reconnection of iSCSI could not be as simple as with HAST, =
iSCSI could require more administration after a disconnection. But this =
could easily be done by a script.
>>=20
>> # Some "advices" based on my findings (I'm finishing my tests of such =
a solution) :
>>=20
>> Write performance will suffer from network latency, but while your 2 =
nodes are in the same room, that should be OK.
>> If you are over a long distance link, you may add several ms to each =
write IO, which, depending on the use case, may be wrong, ZFS may also =
be unresponsive.
>> Max throughput is also more difficult to achieve over a high latency =
link.
>>=20
>> You will have to choose network cards depending on the number of =
disks and their throughput.
>> For example, if you need to resilver a SATA disk (180MB/s), then a =
simple 1GB interface (120MB/s) will be a serious bottleneck.
>> Think about scrub too.
>>=20
>> You should have to perform some network tuning (TCP window size, =
jumbo frame...) to reach your max bandwidth.
>> Trying to saturate network link with (for example) iPerf before =
dealing with iSCSI seems to be a good thing.
>>=20
>> Here are some interesting sysctl so that ZFS will not hang (too long) =
in case of an unreachable iSCSI disk :
>> kern.iscsi.ping_timeout=3D5
>> kern.iscsi.iscsid_timeout=3D5
>> kern.iscsi.login_timeout=3D5
>> kern.iscsi.fail_on_disconnection=3D1
>> (adjust the 5 seconds depending on your needs / on your network =
quality).
>>=20
>> Take care when you (auto)replace disks, you may replace an iSCSI disk =
with a local disk, which of course would work but would be wrong in =
terms of master/slave redundancy.
>> Use nice labels on your disks so that if you have a lot of disks in =
your pool, you quickly know which one is local, which one is remote.
>>=20
>> # send/receive pro(s) :
>>=20
>> In terms of data safety, one of the interests of ZFS send/receive is =
that you have a totally different target pool, which can be interesting =
if ever you have a disaster with your primary pool.
>> As a 3rd node solution ? On another site ? (as send/receive does not =
suffer as iSCSI would from latency)
>=20
> Thank you very much for those "advices", it is much appreciated!=20
>=20
> I'll definitively go with iSCSI (for which I haven't that much=20
> experience) over HAST.
>=20
> Maybe a stupid question but, assuming on the MASTER with ada{0,1} the=20=

> local disks and da{0,1} the exported iSCSI disks from the SLAVE, would=20=

> you go with:
>=20
> $> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0
> /dev/da1

No, if you loose connection with slave node, your pool will go offline !

> or rather:
>=20
> $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
> /dev/da1

Yes, each master disk is mirrored with a slave disk.

> I guess the former is better, but it's just to be sure .. (or maybe =
it's
> better to iSCSI export a ZVOL from the SLAVE?)
>=20
> Correct me if I'm wrong but, from a safety point of view this setup is=20=

> also the safest as you'll get the "fullsync" equivalent mode of HAST
> (but but it's also the slowest), so I can be 99,99% confident that the
> pool on the SLAVE will never be corrupted, even in the case where the
> MASTER suddently die (power outage, etc), and that a zpool import -f
> storage will always work?

Pool on slave is the same as pool on master, as it uses the same disks =
:)
Only the physical host will change.
So yes you can be confident.
There is still the case where any ZFS pool could be totally damaged (due =
to a bug for example).
It "should" not arrive, but we never know :)
This is why I was talking about a third node / second pool made from a =
delayed send/receive.

> One last thing: this "storage" pool will be exported through NFS on =
the=20
> clients, and when a failover occur they should, in theory, not notice
> it. I know that it's pretty hypothetical but I wondered if pfsync =
could
> play a role in this area (active connections)..?

There will certainly be some small timeouts due to the failover delay.
You should make some tests to analyze NFS behaviour depending on the =
failover delay.

Good question regarding pfsync, I'm not so familiar with it :)



Of course, make a good POC before going with this into production.
Don't forget to test scrub, resilver, power failure, network failure...

And perhaps one may have additional comments / ideas / reserve on this =
topic ?



> Thanks!
> Julien
>=20
>>=20
>>>>>> ZFS would then know as soon as a disk is failing.
>>>>>> And if the master fails, you only have to import (-f certainly, =
in case of a master power failure) on the slave.
>>>>>>=20
>>>>>> Ben
>> _______________________________________________
>> freebsd-fs@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>=20
> --=20
> Julien Cigar
> Belgian Biodiversity Platform (http://www.biodiversity.be)
> PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
> No trees were killed in the creation of this message.
> However, many electrons were terribly inconvenienced.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?26A31227-B71D-4854-B046-61CD3449E442>