Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Jul 2016 10:47:17 +0200
From:      Julien Cigar <julien@perdition.city>
To:        Ben RUBSON <ben.rubson@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <20160701084717.GE5695@mordor.lan>
In-Reply-To: <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com>
References:  <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan> <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--M/SuVGWktc5uNpra
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jun 30, 2016 at 11:35:49PM +0200, Ben RUBSON wrote:
>=20
> > On 30 Jun 2016, at 18:35, Julien Cigar <julien@perdition.city> wrote:
> >=20
> > On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote:
> >>=20
> >>=20
> >>> On 30 Jun 2016, at 17:37, Julien Cigar <julien@perdition.city> wrote:
> >>>=20
> >>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote:
> >>>>=20
> >>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter <jg@inte=
rnetx.com> wrote:
> >>>>>=20
> >>>>>=20
> >>>>>=20
> >>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar:
> >>>>>> Hello,
> >>>>>>=20
> >>>>>> I'm always in the process of setting a redundant low-cost storage =
for=20
> >>>>>> our (small, ~30 people) team here.
> >>>>>>=20
> >>>>>> I read quite a lot of articles/documentations/etc and I plan to us=
e HAST
> >>>>>> with ZFS for the storage, CARP for the failover and the "good old =
NFS"
> >>>>>> to mount the shares on the clients.
> >>>>>>=20
> >>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks fo=
r the
> >>>>>> shared storage.
> >>>>>>=20
> >>>>>> Assuming the following configuration:
> >>>>>> - MASTER is the active node and BACKUP is the standby node.
> >>>>>> - two disks in each machine: ada0 and ada1.
> >>>>>> - two interfaces in each machine: em0 and em1
> >>>>>> - em0 is the primary interface (with CARP setup)
> >>>>>> - em1 is dedicated to the HAST traffic (crossover cable)
> >>>>>> - FreeBSD is properly installed in each machine.
> >>>>>> - a HAST resource "disk0" for ada0p2.
> >>>>>> - a HAST resource "disk1" for ada1p2.
> >>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is c=
reated
> >>>>>> on MASTER
> >>>>>>=20
> >>>>>> A couple of questions I am still wondering:
> >>>>>> - If a disk dies on the MASTER I guess that zpool will not see it =
and
> >>>>>> will transparently use the one on BACKUP through the HAST ressourc=
e..
> >>>>>=20
> >>>>> thats right, as long as writes on $anything have been successful ha=
st is
> >>>>> happy and wont start whining
> >>>>>=20
> >>>>>> is it a problem?=20
> >>>>>=20
> >>>>> imho yes, at least from management view
> >>>>>=20
> >>>>>> could this lead to some corruption?
> >>>>>=20
> >>>>> probably, i never heard about anyone who uses that for long time in
> >>>>> production
> >>>>>=20
> >>>>> At this stage the
> >>>>>> common sense would be to replace the disk quickly, but imagine the
> >>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see =
it=20
> >>>>>> and will transparently use the one from the BACKUP node (through t=
he=20
> >>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will not=
=20
> >>>>>> see it and will transparently use the one from the BACKUP node=20
> >>>>>> (through the "disk0" HAST ressource). At this point on MASTER the =
two=20
> >>>>>> disks are broken but the pool is still considered healthy ... What=
 if=20
> >>>>>> after that we unplug the em0 network cable on BACKUP? Storage is
> >>>>>> down..
> >>>>>> - Under heavy I/O the MASTER box suddently dies (for some reasons)=
,=20
> >>>>>> thanks to CARP the BACKUP node will switch from standy -> active a=
nd=20
> >>>>>> execute the failover script which does some "hastctl role primary"=
 for
> >>>>>> the ressources and a zpool import. I wondered if there are any
> >>>>>> situations where the pool couldn't be imported (=3D data corruptio=
n)?
> >>>>>> For example what if the pool hasn't been exported on the MASTER be=
fore
> >>>>>> it dies?
> >>>>>> - Is it a problem if the NFS daemons are started at boot on the st=
andby
> >>>>>> node, or should they only be started in the failover script? What
> >>>>>> about stale files and active connections on the clients?
> >>>>>=20
> >>>>> sometimes stale mounts recover, sometimes not, sometimes clients ne=
ed
> >>>>> even reboots
> >>>>>=20
> >>>>>> - A catastrophic power failure occur and MASTER and BACKUP are sud=
dently
> >>>>>> powered down. Later the power returns, is it possible that some
> >>>>>> problem occur (split-brain scenario ?) regarding the order in whic=
h the
> >>>>>=20
> >>>>> sure, you need an exact procedure to recover
> >>>>>=20
> >>>>>> two machines boot up?
> >>>>>=20
> >>>>> best practice should be to keep everything down after boot
> >>>>>=20
> >>>>>> - Other things I have not thought?
> >>>>>>=20
> >>>>>=20
> >>>>>=20
> >>>>>=20
> >>>>>> Thanks!
> >>>>>> Julien
> >>>>>>=20
> >>>>>=20
> >>>>>=20
> >>>>> imho:
> >>>>>=20
> >>>>> leave hast where it is, go for zfs replication. will save your butt,
> >>>>> sooner or later if you avoid this fragile combination
> >>>>=20
> >>>> I was also replying, and finishing by this :
> >>>> Why don't you set your slave as an iSCSI target and simply do ZFS mi=
rroring ?
> >>>=20
> >>> Yes that's another option, so a zpool with two mirrors (local +=20
> >>> exported iSCSI) ?
> >>=20
> >> Yes, you would then have a real time replication solution (as HAST), c=
ompared to ZFS send/receive which is not.
> >> Depends on what you need :)
> >=20
> > More a real time replication solution in fact ... :)
> > Do you have any resource which resume all the pro(s) and con(s) of HAST
> > vs iSCSI ? I have found a lot of article on ZFS + HAST but not that much
> > with ZFS + iSCSI ..=20
>=20
> # No resources, but some ideas :
>=20
> - ZFS likes to see all the details of its underlying disks, which is poss=
ible with local disks (of course) and iSCSI disks, not with HAST.
> - iSCSI solution is simpler, you only have ZFS to manage, your replicatio=
n is made by ZFS itself, not by an additional stack.
> - HAST does not seem to be really maintained (I may be wrong), at least c=
ompared to DRBD HAST seems to be inspired from.
> - You do not have to cross your fingers when you promote your slave to ma=
ster ("will ZFS be happy with my HAST replicated disks ?"), ZFS mirrored da=
ta by itself, you only have to import [-f].
>=20
> - (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI =
could require more administration after a disconnection. But this could eas=
ily be done by a script.
>=20
> # Some "advices" based on my findings (I'm finishing my tests of such a s=
olution) :
>=20
> Write performance will suffer from network latency, but while your 2 node=
s are in the same room, that should be OK.
> If you are over a long distance link, you may add several ms to each writ=
e IO, which, depending on the use case, may be wrong, ZFS may also be unres=
ponsive.
> Max throughput is also more difficult to achieve over a high latency link.
>=20
> You will have to choose network cards depending on the number of disks an=
d their throughput.
> For example, if you need to resilver a SATA disk (180MB/s), then a simple=
 1GB interface (120MB/s) will be a serious bottleneck.
> Think about scrub too.
>=20
> You should have to perform some network tuning (TCP window size, jumbo fr=
ame...) to reach your max bandwidth.
> Trying to saturate network link with (for example) iPerf before dealing w=
ith iSCSI seems to be a good thing.
>=20
> Here are some interesting sysctl so that ZFS will not hang (too long) in =
case of an unreachable iSCSI disk :
> kern.iscsi.ping_timeout=3D5
> kern.iscsi.iscsid_timeout=3D5
> kern.iscsi.login_timeout=3D5
> kern.iscsi.fail_on_disconnection=3D1
> (adjust the 5 seconds depending on your needs / on your network quality).
>=20
> Take care when you (auto)replace disks, you may replace an iSCSI disk wit=
h a local disk, which of course would work but would be wrong in terms of m=
aster/slave redundancy.
> Use nice labels on your disks so that if you have a lot of disks in your =
pool, you quickly know which one is local, which one is remote.
>=20
> # send/receive pro(s) :
>=20
> In terms of data safety, one of the interests of ZFS send/receive is that=
 you have a totally different target pool, which can be interesting if ever=
 you have a disaster with your primary pool.
> As a 3rd node solution ? On another site ? (as send/receive does not suff=
er as iSCSI would from latency)

Thank you very much for those "advices", it is much appreciated!=20

I'll definitively go with iSCSI (for which I haven't that much=20
experience) over HAST.

Maybe a stupid question but, assuming on the MASTER with ada{0,1} the=20
local disks and da{0,1} the exported iSCSI disks from the SLAVE, would=20
you go with:

$> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0
/dev/da1

or rather:

$> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
/dev/da1

I guess the former is better, but it's just to be sure .. (or maybe it's
better to iSCSI export a ZVOL from the SLAVE?)

Correct me if I'm wrong but, from a safety point of view this setup is=20
also the safest as you'll get the "fullsync" equivalent mode of HAST
(but but it's also the slowest), so I can be 99,99% confident that the
pool on the SLAVE will never be corrupted, even in the case where the
MASTER suddently die (power outage, etc), and that a zpool import -f
storage will always work?

One last thing: this "storage" pool will be exported through NFS on the=20
clients, and when a failover occur they should, in theory, not notice
it. I know that it's pretty hypothetical but I wondered if pfsync could
play a role in this area (active connections)..?

Thanks!
Julien

>=20
> >>>> ZFS would then know as soon as a disk is failing.
> >>>> And if the master fails, you only have to import (-f certainly, in c=
ase of a master power failure) on the slave.
> >>>>=20
> >>>> Ben
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

--=20
Julien Cigar
Belgian Biodiversity Platform (http://www.biodiversity.be)
PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.

--M/SuVGWktc5uNpra
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCgAGBQJXdi4SAAoJELK7NxCiBCPAx+gQAJms3tvFVEensADV8Iys+ScD
k2uU6Gtj7V7ZtoHf5wGdcu467UffTKozk5VFVmkSv1kP0rDkyi/IqMPia30jq4E4
43fhyLmZysMY9E8Lehm5ECX04uIouhaLaJpPn2flV4BMzsdKKoid0UTs4+gKv1e5
yjdCfLIOA5tBNl5wdB3+5+sdepPJprxyGb8grpQ+RlifvxUkIgf0NW3W0Th52nY1
6m2lgd900ZzAD+ySab15xbJ6dg2/bJnLkc3RHfryc3fPRiL8nydsr1yv/FDTNxGp
xfXeDQ+CriHcRaxVYOXtAeaXlETXMTRByd5LYqwEXAQl4sutKuBbqnv9zLYpYJhj
BGOMUG4z/5UEDc2EtKFCBH6AC/qpM4YUbI0POZoEnorO0zp96bBUTLOwPZn3ovZ5
g11O9qdGJls2o5Lvtue1ZxxpvJR1Bd3zZxOH3RBB591RPlvubSZPXRIeccc0Q9S9
/Z5yrV4Zw+t/xt5qJlWx6fnCy2iXaGn44XWgmVgZ0fhgfPoRn4Ae13i2CG7ky4f6
ieq9OpSp2oq6Huu/u/m04WFjWLMfZyq2dFS4eHua+7al7D++n/sIiYMwSgwIgjjh
YLJzExZqE7xDxe+gHUHE48vqdsfJveJYG3xWfTh1hC1z5sy8v3fTbDFFMkRUXCBK
hyytLtlAU/LjmO5/mApt
=fvNz
-----END PGP SIGNATURE-----

--M/SuVGWktc5uNpra--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160701084717.GE5695>