From owner-freebsd-fs@freebsd.org Fri Jul 1 08:59:30 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2F680B87EEC for ; Fri, 1 Jul 2016 08:59:30 +0000 (UTC) (envelope-from julien@perdition.city) Received: from relay-b01.edpnet.be (relay-b01.edpnet.be [212.71.1.221]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "edpnet.email", Issuer "Go Daddy Secure Certificate Authority - G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CA0D62D06 for ; Fri, 1 Jul 2016 08:59:29 +0000 (UTC) (envelope-from julien@perdition.city) X-ASG-Debug-ID: 1467362837-0a7ff569f8fff360001-3nHGF7 Received: from mordor.lan (213.219.165.225.bro01.dyn.edpnet.net [213.219.165.225]) by relay-b01.edpnet.be with ESMTP id KhufTBeGlSOj6BdQ (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 01 Jul 2016 10:47:18 +0200 (CEST) X-Barracuda-Envelope-From: julien@perdition.city X-Barracuda-Effective-Source-IP: 213.219.165.225.bro01.dyn.edpnet.net[213.219.165.225] X-Barracuda-Apparent-Source-IP: 213.219.165.225 Date: Fri, 1 Jul 2016 10:47:17 +0200 From: Julien Cigar To: Ben RUBSON Cc: freebsd-fs@freebsd.org Subject: Re: HAST + ZFS + NFS + CARP Message-ID: <20160701084717.GE5695@mordor.lan> X-ASG-Orig-Subj: Re: HAST + ZFS + NFS + CARP References: <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan> <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="M/SuVGWktc5uNpra" Content-Disposition: inline In-Reply-To: <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com> User-Agent: Mutt/1.6.1 (2016-04-27) X-Barracuda-Connect: 213.219.165.225.bro01.dyn.edpnet.net[213.219.165.225] X-Barracuda-Start-Time: 1467362837 X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384 X-Barracuda-URL: https://212.71.1.221:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 9943 X-Virus-Scanned: by bsmtpd at edpnet.be X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.5000 1.0000 0.0100 X-Barracuda-Spam-Score: 0.01 X-Barracuda-Spam-Status: No, SCORE=0.01 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.30919 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jul 2016 08:59:30 -0000 --M/SuVGWktc5uNpra Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 30, 2016 at 11:35:49PM +0200, Ben RUBSON wrote: >=20 > > On 30 Jun 2016, at 18:35, Julien Cigar wrote: > >=20 > > On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote: > >>=20 > >>=20 > >>> On 30 Jun 2016, at 17:37, Julien Cigar wrote: > >>>=20 > >>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote: > >>>>=20 > >>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter wrote: > >>>>>=20 > >>>>>=20 > >>>>>=20 > >>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar: > >>>>>> Hello, > >>>>>>=20 > >>>>>> I'm always in the process of setting a redundant low-cost storage = for=20 > >>>>>> our (small, ~30 people) team here. > >>>>>>=20 > >>>>>> I read quite a lot of articles/documentations/etc and I plan to us= e HAST > >>>>>> with ZFS for the storage, CARP for the failover and the "good old = NFS" > >>>>>> to mount the shares on the clients. > >>>>>>=20 > >>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks fo= r the > >>>>>> shared storage. > >>>>>>=20 > >>>>>> Assuming the following configuration: > >>>>>> - MASTER is the active node and BACKUP is the standby node. > >>>>>> - two disks in each machine: ada0 and ada1. > >>>>>> - two interfaces in each machine: em0 and em1 > >>>>>> - em0 is the primary interface (with CARP setup) > >>>>>> - em1 is dedicated to the HAST traffic (crossover cable) > >>>>>> - FreeBSD is properly installed in each machine. > >>>>>> - a HAST resource "disk0" for ada0p2. > >>>>>> - a HAST resource "disk1" for ada1p2. > >>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is c= reated > >>>>>> on MASTER > >>>>>>=20 > >>>>>> A couple of questions I am still wondering: > >>>>>> - If a disk dies on the MASTER I guess that zpool will not see it = and > >>>>>> will transparently use the one on BACKUP through the HAST ressourc= e.. > >>>>>=20 > >>>>> thats right, as long as writes on $anything have been successful ha= st is > >>>>> happy and wont start whining > >>>>>=20 > >>>>>> is it a problem?=20 > >>>>>=20 > >>>>> imho yes, at least from management view > >>>>>=20 > >>>>>> could this lead to some corruption? > >>>>>=20 > >>>>> probably, i never heard about anyone who uses that for long time in > >>>>> production > >>>>>=20 > >>>>> At this stage the > >>>>>> common sense would be to replace the disk quickly, but imagine the > >>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see = it=20 > >>>>>> and will transparently use the one from the BACKUP node (through t= he=20 > >>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will not= =20 > >>>>>> see it and will transparently use the one from the BACKUP node=20 > >>>>>> (through the "disk0" HAST ressource). At this point on MASTER the = two=20 > >>>>>> disks are broken but the pool is still considered healthy ... What= if=20 > >>>>>> after that we unplug the em0 network cable on BACKUP? Storage is > >>>>>> down.. > >>>>>> - Under heavy I/O the MASTER box suddently dies (for some reasons)= ,=20 > >>>>>> thanks to CARP the BACKUP node will switch from standy -> active a= nd=20 > >>>>>> execute the failover script which does some "hastctl role primary"= for > >>>>>> the ressources and a zpool import. I wondered if there are any > >>>>>> situations where the pool couldn't be imported (=3D data corruptio= n)? > >>>>>> For example what if the pool hasn't been exported on the MASTER be= fore > >>>>>> it dies? > >>>>>> - Is it a problem if the NFS daemons are started at boot on the st= andby > >>>>>> node, or should they only be started in the failover script? What > >>>>>> about stale files and active connections on the clients? > >>>>>=20 > >>>>> sometimes stale mounts recover, sometimes not, sometimes clients ne= ed > >>>>> even reboots > >>>>>=20 > >>>>>> - A catastrophic power failure occur and MASTER and BACKUP are sud= dently > >>>>>> powered down. Later the power returns, is it possible that some > >>>>>> problem occur (split-brain scenario ?) regarding the order in whic= h the > >>>>>=20 > >>>>> sure, you need an exact procedure to recover > >>>>>=20 > >>>>>> two machines boot up? > >>>>>=20 > >>>>> best practice should be to keep everything down after boot > >>>>>=20 > >>>>>> - Other things I have not thought? > >>>>>>=20 > >>>>>=20 > >>>>>=20 > >>>>>=20 > >>>>>> Thanks! > >>>>>> Julien > >>>>>>=20 > >>>>>=20 > >>>>>=20 > >>>>> imho: > >>>>>=20 > >>>>> leave hast where it is, go for zfs replication. will save your butt, > >>>>> sooner or later if you avoid this fragile combination > >>>>=20 > >>>> I was also replying, and finishing by this : > >>>> Why don't you set your slave as an iSCSI target and simply do ZFS mi= rroring ? > >>>=20 > >>> Yes that's another option, so a zpool with two mirrors (local +=20 > >>> exported iSCSI) ? > >>=20 > >> Yes, you would then have a real time replication solution (as HAST), c= ompared to ZFS send/receive which is not. > >> Depends on what you need :) > >=20 > > More a real time replication solution in fact ... :) > > Do you have any resource which resume all the pro(s) and con(s) of HAST > > vs iSCSI ? I have found a lot of article on ZFS + HAST but not that much > > with ZFS + iSCSI ..=20 >=20 > # No resources, but some ideas : >=20 > - ZFS likes to see all the details of its underlying disks, which is poss= ible with local disks (of course) and iSCSI disks, not with HAST. > - iSCSI solution is simpler, you only have ZFS to manage, your replicatio= n is made by ZFS itself, not by an additional stack. > - HAST does not seem to be really maintained (I may be wrong), at least c= ompared to DRBD HAST seems to be inspired from. > - You do not have to cross your fingers when you promote your slave to ma= ster ("will ZFS be happy with my HAST replicated disks ?"), ZFS mirrored da= ta by itself, you only have to import [-f]. >=20 > - (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI = could require more administration after a disconnection. But this could eas= ily be done by a script. >=20 > # Some "advices" based on my findings (I'm finishing my tests of such a s= olution) : >=20 > Write performance will suffer from network latency, but while your 2 node= s are in the same room, that should be OK. > If you are over a long distance link, you may add several ms to each writ= e IO, which, depending on the use case, may be wrong, ZFS may also be unres= ponsive. > Max throughput is also more difficult to achieve over a high latency link. >=20 > You will have to choose network cards depending on the number of disks an= d their throughput. > For example, if you need to resilver a SATA disk (180MB/s), then a simple= 1GB interface (120MB/s) will be a serious bottleneck. > Think about scrub too. >=20 > You should have to perform some network tuning (TCP window size, jumbo fr= ame...) to reach your max bandwidth. > Trying to saturate network link with (for example) iPerf before dealing w= ith iSCSI seems to be a good thing. >=20 > Here are some interesting sysctl so that ZFS will not hang (too long) in = case of an unreachable iSCSI disk : > kern.iscsi.ping_timeout=3D5 > kern.iscsi.iscsid_timeout=3D5 > kern.iscsi.login_timeout=3D5 > kern.iscsi.fail_on_disconnection=3D1 > (adjust the 5 seconds depending on your needs / on your network quality). >=20 > Take care when you (auto)replace disks, you may replace an iSCSI disk wit= h a local disk, which of course would work but would be wrong in terms of m= aster/slave redundancy. > Use nice labels on your disks so that if you have a lot of disks in your = pool, you quickly know which one is local, which one is remote. >=20 > # send/receive pro(s) : >=20 > In terms of data safety, one of the interests of ZFS send/receive is that= you have a totally different target pool, which can be interesting if ever= you have a disaster with your primary pool. > As a 3rd node solution ? On another site ? (as send/receive does not suff= er as iSCSI would from latency) Thank you very much for those "advices", it is much appreciated!=20 I'll definitively go with iSCSI (for which I haven't that much=20 experience) over HAST. Maybe a stupid question but, assuming on the MASTER with ada{0,1} the=20 local disks and da{0,1} the exported iSCSI disks from the SLAVE, would=20 you go with: $> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0 /dev/da1 or rather: $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1 /dev/da1 I guess the former is better, but it's just to be sure .. (or maybe it's better to iSCSI export a ZVOL from the SLAVE?) Correct me if I'm wrong but, from a safety point of view this setup is=20 also the safest as you'll get the "fullsync" equivalent mode of HAST (but but it's also the slowest), so I can be 99,99% confident that the pool on the SLAVE will never be corrupted, even in the case where the MASTER suddently die (power outage, etc), and that a zpool import -f storage will always work? One last thing: this "storage" pool will be exported through NFS on the=20 clients, and when a failover occur they should, in theory, not notice it. I know that it's pretty hypothetical but I wondered if pfsync could play a role in this area (active connections)..? Thanks! Julien >=20 > >>>> ZFS would then know as soon as a disk is failing. > >>>> And if the master fails, you only have to import (-f certainly, in c= ase of a master power failure) on the slave. > >>>>=20 > >>>> Ben > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" --=20 Julien Cigar Belgian Biodiversity Platform (http://www.biodiversity.be) PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0 No trees were killed in the creation of this message. However, many electrons were terribly inconvenienced. --M/SuVGWktc5uNpra Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCgAGBQJXdi4SAAoJELK7NxCiBCPAx+gQAJms3tvFVEensADV8Iys+ScD k2uU6Gtj7V7ZtoHf5wGdcu467UffTKozk5VFVmkSv1kP0rDkyi/IqMPia30jq4E4 43fhyLmZysMY9E8Lehm5ECX04uIouhaLaJpPn2flV4BMzsdKKoid0UTs4+gKv1e5 yjdCfLIOA5tBNl5wdB3+5+sdepPJprxyGb8grpQ+RlifvxUkIgf0NW3W0Th52nY1 6m2lgd900ZzAD+ySab15xbJ6dg2/bJnLkc3RHfryc3fPRiL8nydsr1yv/FDTNxGp xfXeDQ+CriHcRaxVYOXtAeaXlETXMTRByd5LYqwEXAQl4sutKuBbqnv9zLYpYJhj BGOMUG4z/5UEDc2EtKFCBH6AC/qpM4YUbI0POZoEnorO0zp96bBUTLOwPZn3ovZ5 g11O9qdGJls2o5Lvtue1ZxxpvJR1Bd3zZxOH3RBB591RPlvubSZPXRIeccc0Q9S9 /Z5yrV4Zw+t/xt5qJlWx6fnCy2iXaGn44XWgmVgZ0fhgfPoRn4Ae13i2CG7ky4f6 ieq9OpSp2oq6Huu/u/m04WFjWLMfZyq2dFS4eHua+7al7D++n/sIiYMwSgwIgjjh YLJzExZqE7xDxe+gHUHE48vqdsfJveJYG3xWfTh1hC1z5sy8v3fTbDFFMkRUXCBK hyytLtlAU/LjmO5/mApt =fvNz -----END PGP SIGNATURE----- --M/SuVGWktc5uNpra--