Date: Fri, 1 Jul 2016 11:19:08 +0200 From: Ben RUBSON <ben.rubson@gmail.com> To: freebsd-fs@freebsd.org Subject: Re: HAST + ZFS + NFS + CARP Message-ID: <26A31227-B71D-4854-B046-61CD3449E442@gmail.com> In-Reply-To: <20160701084717.GE5695@mordor.lan> References: <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan> <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com> <20160701084717.GE5695@mordor.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 01 Jul 2016, at 10:47, Julien Cigar <julien@perdition.city> wrote: >=20 > On Thu, Jun 30, 2016 at 11:35:49PM +0200, Ben RUBSON wrote: >>=20 >>> On 30 Jun 2016, at 18:35, Julien Cigar <julien@perdition.city> = wrote: >>>=20 >>> On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote: >>>>=20 >>>>=20 >>>>> On 30 Jun 2016, at 17:37, Julien Cigar <julien@perdition.city> = wrote: >>>>>=20 >>>>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote: >>>>>>=20 >>>>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter = <jg@internetx.com> wrote: >>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar: >>>>>>>> Hello, >>>>>>>>=20 >>>>>>>> I'm always in the process of setting a redundant low-cost = storage for=20 >>>>>>>> our (small, ~30 people) team here. >>>>>>>>=20 >>>>>>>> I read quite a lot of articles/documentations/etc and I plan to = use HAST >>>>>>>> with ZFS for the storage, CARP for the failover and the "good = old NFS" >>>>>>>> to mount the shares on the clients. >>>>>>>>=20 >>>>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks = for the >>>>>>>> shared storage. >>>>>>>>=20 >>>>>>>> Assuming the following configuration: >>>>>>>> - MASTER is the active node and BACKUP is the standby node. >>>>>>>> - two disks in each machine: ada0 and ada1. >>>>>>>> - two interfaces in each machine: em0 and em1 >>>>>>>> - em0 is the primary interface (with CARP setup) >>>>>>>> - em1 is dedicated to the HAST traffic (crossover cable) >>>>>>>> - FreeBSD is properly installed in each machine. >>>>>>>> - a HAST resource "disk0" for ada0p2. >>>>>>>> - a HAST resource "disk1" for ada1p2. >>>>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 = is created >>>>>>>> on MASTER >>>>>>>>=20 >>>>>>>> A couple of questions I am still wondering: >>>>>>>> - If a disk dies on the MASTER I guess that zpool will not see = it and >>>>>>>> will transparently use the one on BACKUP through the HAST = ressource.. >>>>>>>=20 >>>>>>> thats right, as long as writes on $anything have been successful = hast is >>>>>>> happy and wont start whining >>>>>>>=20 >>>>>>>> is it a problem?=20 >>>>>>>=20 >>>>>>> imho yes, at least from management view >>>>>>>=20 >>>>>>>> could this lead to some corruption? >>>>>>>=20 >>>>>>> probably, i never heard about anyone who uses that for long time = in >>>>>>> production >>>>>>>=20 >>>>>>> At this stage the >>>>>>>> common sense would be to replace the disk quickly, but imagine = the >>>>>>>> worst case scenario where ada1 on MASTER dies, zpool will not = see it=20 >>>>>>>> and will transparently use the one from the BACKUP node = (through the=20 >>>>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will = not=20 >>>>>>>> see it and will transparently use the one from the BACKUP node=20= >>>>>>>> (through the "disk0" HAST ressource). At this point on MASTER = the two=20 >>>>>>>> disks are broken but the pool is still considered healthy ... = What if=20 >>>>>>>> after that we unplug the em0 network cable on BACKUP? Storage = is >>>>>>>> down.. >>>>>>>> - Under heavy I/O the MASTER box suddently dies (for some = reasons),=20 >>>>>>>> thanks to CARP the BACKUP node will switch from standy -> = active and=20 >>>>>>>> execute the failover script which does some "hastctl role = primary" for >>>>>>>> the ressources and a zpool import. I wondered if there are any >>>>>>>> situations where the pool couldn't be imported (=3D data = corruption)? >>>>>>>> For example what if the pool hasn't been exported on the MASTER = before >>>>>>>> it dies? >>>>>>>> - Is it a problem if the NFS daemons are started at boot on the = standby >>>>>>>> node, or should they only be started in the failover script? = What >>>>>>>> about stale files and active connections on the clients? >>>>>>>=20 >>>>>>> sometimes stale mounts recover, sometimes not, sometimes clients = need >>>>>>> even reboots >>>>>>>=20 >>>>>>>> - A catastrophic power failure occur and MASTER and BACKUP are = suddently >>>>>>>> powered down. Later the power returns, is it possible that some >>>>>>>> problem occur (split-brain scenario ?) regarding the order in = which the >>>>>>>=20 >>>>>>> sure, you need an exact procedure to recover >>>>>>>=20 >>>>>>>> two machines boot up? >>>>>>>=20 >>>>>>> best practice should be to keep everything down after boot >>>>>>>=20 >>>>>>>> - Other things I have not thought? >>>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>>> Thanks! >>>>>>>> Julien >>>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>> imho: >>>>>>>=20 >>>>>>> leave hast where it is, go for zfs replication. will save your = butt, >>>>>>> sooner or later if you avoid this fragile combination >>>>>>=20 >>>>>> I was also replying, and finishing by this : >>>>>> Why don't you set your slave as an iSCSI target and simply do ZFS = mirroring ? >>>>>=20 >>>>> Yes that's another option, so a zpool with two mirrors (local +=20 >>>>> exported iSCSI) ? >>>>=20 >>>> Yes, you would then have a real time replication solution (as = HAST), compared to ZFS send/receive which is not. >>>> Depends on what you need :) >>>=20 >>> More a real time replication solution in fact ... :) >>> Do you have any resource which resume all the pro(s) and con(s) of = HAST >>> vs iSCSI ? I have found a lot of article on ZFS + HAST but not that = much >>> with ZFS + iSCSI ..=20 >>=20 >> # No resources, but some ideas : >>=20 >> - ZFS likes to see all the details of its underlying disks, which is = possible with local disks (of course) and iSCSI disks, not with HAST. >> - iSCSI solution is simpler, you only have ZFS to manage, your = replication is made by ZFS itself, not by an additional stack. >> - HAST does not seem to be really maintained (I may be wrong), at = least compared to DRBD HAST seems to be inspired from. >> - You do not have to cross your fingers when you promote your slave = to master ("will ZFS be happy with my HAST replicated disks ?"), ZFS = mirrored data by itself, you only have to import [-f]. >>=20 >> - (auto)reconnection of iSCSI could not be as simple as with HAST, = iSCSI could require more administration after a disconnection. But this = could easily be done by a script. >>=20 >> # Some "advices" based on my findings (I'm finishing my tests of such = a solution) : >>=20 >> Write performance will suffer from network latency, but while your 2 = nodes are in the same room, that should be OK. >> If you are over a long distance link, you may add several ms to each = write IO, which, depending on the use case, may be wrong, ZFS may also = be unresponsive. >> Max throughput is also more difficult to achieve over a high latency = link. >>=20 >> You will have to choose network cards depending on the number of = disks and their throughput. >> For example, if you need to resilver a SATA disk (180MB/s), then a = simple 1GB interface (120MB/s) will be a serious bottleneck. >> Think about scrub too. >>=20 >> You should have to perform some network tuning (TCP window size, = jumbo frame...) to reach your max bandwidth. >> Trying to saturate network link with (for example) iPerf before = dealing with iSCSI seems to be a good thing. >>=20 >> Here are some interesting sysctl so that ZFS will not hang (too long) = in case of an unreachable iSCSI disk : >> kern.iscsi.ping_timeout=3D5 >> kern.iscsi.iscsid_timeout=3D5 >> kern.iscsi.login_timeout=3D5 >> kern.iscsi.fail_on_disconnection=3D1 >> (adjust the 5 seconds depending on your needs / on your network = quality). >>=20 >> Take care when you (auto)replace disks, you may replace an iSCSI disk = with a local disk, which of course would work but would be wrong in = terms of master/slave redundancy. >> Use nice labels on your disks so that if you have a lot of disks in = your pool, you quickly know which one is local, which one is remote. >>=20 >> # send/receive pro(s) : >>=20 >> In terms of data safety, one of the interests of ZFS send/receive is = that you have a totally different target pool, which can be interesting = if ever you have a disaster with your primary pool. >> As a 3rd node solution ? On another site ? (as send/receive does not = suffer as iSCSI would from latency) >=20 > Thank you very much for those "advices", it is much appreciated!=20 >=20 > I'll definitively go with iSCSI (for which I haven't that much=20 > experience) over HAST. >=20 > Maybe a stupid question but, assuming on the MASTER with ada{0,1} the=20= > local disks and da{0,1} the exported iSCSI disks from the SLAVE, would=20= > you go with: >=20 > $> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0 > /dev/da1 No, if you loose connection with slave node, your pool will go offline ! > or rather: >=20 > $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1 > /dev/da1 Yes, each master disk is mirrored with a slave disk. > I guess the former is better, but it's just to be sure .. (or maybe = it's > better to iSCSI export a ZVOL from the SLAVE?) >=20 > Correct me if I'm wrong but, from a safety point of view this setup is=20= > also the safest as you'll get the "fullsync" equivalent mode of HAST > (but but it's also the slowest), so I can be 99,99% confident that the > pool on the SLAVE will never be corrupted, even in the case where the > MASTER suddently die (power outage, etc), and that a zpool import -f > storage will always work? Pool on slave is the same as pool on master, as it uses the same disks = :) Only the physical host will change. So yes you can be confident. There is still the case where any ZFS pool could be totally damaged (due = to a bug for example). It "should" not arrive, but we never know :) This is why I was talking about a third node / second pool made from a = delayed send/receive. > One last thing: this "storage" pool will be exported through NFS on = the=20 > clients, and when a failover occur they should, in theory, not notice > it. I know that it's pretty hypothetical but I wondered if pfsync = could > play a role in this area (active connections)..? There will certainly be some small timeouts due to the failover delay. You should make some tests to analyze NFS behaviour depending on the = failover delay. Good question regarding pfsync, I'm not so familiar with it :) Of course, make a good POC before going with this into production. Don't forget to test scrub, resilver, power failure, network failure... And perhaps one may have additional comments / ideas / reserve on this = topic ? > Thanks! > Julien >=20 >>=20 >>>>>> ZFS would then know as soon as a disk is failing. >>>>>> And if the master fails, you only have to import (-f certainly, = in case of a master power failure) on the slave. >>>>>>=20 >>>>>> Ben >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20 > --=20 > Julien Cigar > Belgian Biodiversity Platform (http://www.biodiversity.be) > PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0 > No trees were killed in the creation of this message. > However, many electrons were terribly inconvenienced.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?26A31227-B71D-4854-B046-61CD3449E442>