Date: Thu, 30 Jun 2016 23:35:49 +0200 From: Ben RUBSON <ben.rubson@gmail.com> To: freebsd-fs@freebsd.org Subject: Re: HAST + ZFS + NFS + CARP Message-ID: <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com> In-Reply-To: <20160630163541.GC5695@mordor.lan> References: <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <AD42D8FD-D07B-454E-B79D-028C1EC57381@gmail.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 30 Jun 2016, at 18:35, Julien Cigar <julien@perdition.city> wrote: >=20 > On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote: >>=20 >>=20 >>> On 30 Jun 2016, at 17:37, Julien Cigar <julien@perdition.city> = wrote: >>>=20 >>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote: >>>>=20 >>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter = <jg@internetx.com> wrote: >>>>>=20 >>>>>=20 >>>>>=20 >>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar: >>>>>> Hello, >>>>>>=20 >>>>>> I'm always in the process of setting a redundant low-cost storage = for=20 >>>>>> our (small, ~30 people) team here. >>>>>>=20 >>>>>> I read quite a lot of articles/documentations/etc and I plan to = use HAST >>>>>> with ZFS for the storage, CARP for the failover and the "good old = NFS" >>>>>> to mount the shares on the clients. >>>>>>=20 >>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks = for the >>>>>> shared storage. >>>>>>=20 >>>>>> Assuming the following configuration: >>>>>> - MASTER is the active node and BACKUP is the standby node. >>>>>> - two disks in each machine: ada0 and ada1. >>>>>> - two interfaces in each machine: em0 and em1 >>>>>> - em0 is the primary interface (with CARP setup) >>>>>> - em1 is dedicated to the HAST traffic (crossover cable) >>>>>> - FreeBSD is properly installed in each machine. >>>>>> - a HAST resource "disk0" for ada0p2. >>>>>> - a HAST resource "disk1" for ada1p2. >>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is = created >>>>>> on MASTER >>>>>>=20 >>>>>> A couple of questions I am still wondering: >>>>>> - If a disk dies on the MASTER I guess that zpool will not see it = and >>>>>> will transparently use the one on BACKUP through the HAST = ressource.. >>>>>=20 >>>>> thats right, as long as writes on $anything have been successful = hast is >>>>> happy and wont start whining >>>>>=20 >>>>>> is it a problem?=20 >>>>>=20 >>>>> imho yes, at least from management view >>>>>=20 >>>>>> could this lead to some corruption? >>>>>=20 >>>>> probably, i never heard about anyone who uses that for long time = in >>>>> production >>>>>=20 >>>>> At this stage the >>>>>> common sense would be to replace the disk quickly, but imagine = the >>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see = it=20 >>>>>> and will transparently use the one from the BACKUP node (through = the=20 >>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will = not=20 >>>>>> see it and will transparently use the one from the BACKUP node=20 >>>>>> (through the "disk0" HAST ressource). At this point on MASTER the = two=20 >>>>>> disks are broken but the pool is still considered healthy ... = What if=20 >>>>>> after that we unplug the em0 network cable on BACKUP? Storage is >>>>>> down.. >>>>>> - Under heavy I/O the MASTER box suddently dies (for some = reasons),=20 >>>>>> thanks to CARP the BACKUP node will switch from standy -> active = and=20 >>>>>> execute the failover script which does some "hastctl role = primary" for >>>>>> the ressources and a zpool import. I wondered if there are any >>>>>> situations where the pool couldn't be imported (=3D data = corruption)? >>>>>> For example what if the pool hasn't been exported on the MASTER = before >>>>>> it dies? >>>>>> - Is it a problem if the NFS daemons are started at boot on the = standby >>>>>> node, or should they only be started in the failover script? What >>>>>> about stale files and active connections on the clients? >>>>>=20 >>>>> sometimes stale mounts recover, sometimes not, sometimes clients = need >>>>> even reboots >>>>>=20 >>>>>> - A catastrophic power failure occur and MASTER and BACKUP are = suddently >>>>>> powered down. Later the power returns, is it possible that some >>>>>> problem occur (split-brain scenario ?) regarding the order in = which the >>>>>=20 >>>>> sure, you need an exact procedure to recover >>>>>=20 >>>>>> two machines boot up? >>>>>=20 >>>>> best practice should be to keep everything down after boot >>>>>=20 >>>>>> - Other things I have not thought? >>>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>> Thanks! >>>>>> Julien >>>>>>=20 >>>>>=20 >>>>>=20 >>>>> imho: >>>>>=20 >>>>> leave hast where it is, go for zfs replication. will save your = butt, >>>>> sooner or later if you avoid this fragile combination >>>>=20 >>>> I was also replying, and finishing by this : >>>> Why don't you set your slave as an iSCSI target and simply do ZFS = mirroring ? >>>=20 >>> Yes that's another option, so a zpool with two mirrors (local +=20 >>> exported iSCSI) ? >>=20 >> Yes, you would then have a real time replication solution (as HAST), = compared to ZFS send/receive which is not. >> Depends on what you need :) >=20 > More a real time replication solution in fact ... :) > Do you have any resource which resume all the pro(s) and con(s) of = HAST > vs iSCSI ? I have found a lot of article on ZFS + HAST but not that = much > with ZFS + iSCSI ..=20 # No resources, but some ideas : - ZFS likes to see all the details of its underlying disks, which is = possible with local disks (of course) and iSCSI disks, not with HAST. - iSCSI solution is simpler, you only have ZFS to manage, your = replication is made by ZFS itself, not by an additional stack. - HAST does not seem to be really maintained (I may be wrong), at least = compared to DRBD HAST seems to be inspired from. - You do not have to cross your fingers when you promote your slave to = master ("will ZFS be happy with my HAST replicated disks ?"), ZFS = mirrored data by itself, you only have to import [-f]. - (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI = could require more administration after a disconnection. But this could = easily be done by a script. # Some "advices" based on my findings (I'm finishing my tests of such a = solution) : Write performance will suffer from network latency, but while your 2 = nodes are in the same room, that should be OK. If you are over a long distance link, you may add several ms to each = write IO, which, depending on the use case, may be wrong, ZFS may also = be unresponsive. Max throughput is also more difficult to achieve over a high latency = link. You will have to choose network cards depending on the number of disks = and their throughput. For example, if you need to resilver a SATA disk (180MB/s), then a = simple 1GB interface (120MB/s) will be a serious bottleneck. Think about scrub too. You should have to perform some network tuning (TCP window size, jumbo = frame...) to reach your max bandwidth. Trying to saturate network link with (for example) iPerf before dealing = with iSCSI seems to be a good thing. Here are some interesting sysctl so that ZFS will not hang (too long) in = case of an unreachable iSCSI disk : kern.iscsi.ping_timeout=3D5 kern.iscsi.iscsid_timeout=3D5 kern.iscsi.login_timeout=3D5 kern.iscsi.fail_on_disconnection=3D1 (adjust the 5 seconds depending on your needs / on your network = quality). Take care when you (auto)replace disks, you may replace an iSCSI disk = with a local disk, which of course would work but would be wrong in = terms of master/slave redundancy. Use nice labels on your disks so that if you have a lot of disks in your = pool, you quickly know which one is local, which one is remote. # send/receive pro(s) : In terms of data safety, one of the interests of ZFS send/receive is = that you have a totally different target pool, which can be interesting = if ever you have a disaster with your primary pool. As a 3rd node solution ? On another site ? (as send/receive does not = suffer as iSCSI would from latency) >>>> ZFS would then know as soon as a disk is failing. >>>> And if the master fails, you only have to import (-f certainly, in = case of a master power failure) on the slave. >>>>=20 >>>> Ben
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50BF1AEF-3ECC-4C30-B8E1-678E02735BB5>