From owner-freebsd-fs@freebsd.org Thu Jun 30 21:35:54 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BF966B87241 for ; Thu, 30 Jun 2016 21:35:54 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x235.google.com (mail-wm0-x235.google.com [IPv6:2a00:1450:400c:c09::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 563DD20E5 for ; Thu, 30 Jun 2016 21:35:54 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x235.google.com with SMTP id a66so3845105wme.0 for ; Thu, 30 Jun 2016 14:35:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=psNyJwT1DbTKzCpgAUJmb/Q/KogSgg99tG9MgWIEGzg=; b=XhTan6asp7sfSX69u5/8vqVejMUXsl2geOsVaQwFcBkSGoV8pT0wslDCBHkVbW8R/i lwr8yzpSxeIOUjqycE6FUqj5iPKVx1VYbeZ2EJW+oBkkhtuCMxnQJDjvpqboux+gnG4N 06inja6o+epkTAAx+MK0g7V8qa23VRjXeRMiTKPqJnR2BznJKhg0wdaNSMRByATfUG9A S2hW/0WuCfDoXwiRoMyOPLTIwsQGvtERB/luTyJmzDpa5u0S1JVQA8AehRsP89A0uB1D 6LSxxDklWI84AQOLoTF4+xY5/Hefd1PQwZajB1NmrfZlE8PB84lOmIwOOklSoOBXErEL XvPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=psNyJwT1DbTKzCpgAUJmb/Q/KogSgg99tG9MgWIEGzg=; b=JE1/TU7NaREA4A0EPaUyE0Dw2LzkRHtQKE1hw/4ZZwFSAeDn8pL7UtnsWDHGE1ITFk +FaEIMO8DNeVDsFlP4wfLZaOMT3pqVUo0pmy7s8tZTevHH7IeN+8ZOSFLflZI3FDElRo x/uG65CWFALANt4XRMb4LX4F85QIlbM09wDs8l/YnDCw5TrQYVpGn9iq5cIaZE0WX2R9 1It3U6tyY2ZaEcsUNFdc6AiYeiK5XSROoRT1nb/1uezR3XjSVwE0rl4x/I2AXc8nkxl1 1XgQBOLXwembNT2o6d6PjsX4N9DWtwEDl59MhivrjUCvcMWXbmelfCz+wW8ds2bCI8XO dP2A== X-Gm-Message-State: ALyK8tJjU0sIAo2iWqC9Xxof5bDyzRdB+zkSu/pyE2O2vpXpznKhDjrKp+18x6mm6SasxQ== X-Received: by 10.28.229.147 with SMTP id c141mr30450520wmh.5.1467322551701; Thu, 30 Jun 2016 14:35:51 -0700 (PDT) Received: from macbook-air-de-benjamin-1.home (LFbn-1-7077-85.w90-116.abo.wanadoo.fr. [90.116.246.85]) by smtp.gmail.com with ESMTPSA id k6sm5262286wjz.28.2016.06.30.14.35.50 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 30 Jun 2016 14:35:50 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: HAST + ZFS + NFS + CARP From: Ben RUBSON In-Reply-To: <20160630163541.GC5695@mordor.lan> Date: Thu, 30 Jun 2016 23:35:49 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <50BF1AEF-3ECC-4C30-B8E1-678E02735BB5@gmail.com> References: <20160630144546.GB99997@mordor.lan> <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <20160630153747.GB5695@mordor.lan> <63C07474-BDD5-42AA-BF4A-85A0E04D3CC2@gmail.com> <20160630163541.GC5695@mordor.lan> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jun 2016 21:35:54 -0000 > On 30 Jun 2016, at 18:35, Julien Cigar wrote: >=20 > On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote: >>=20 >>=20 >>> On 30 Jun 2016, at 17:37, Julien Cigar = wrote: >>>=20 >>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote: >>>>=20 >>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter = wrote: >>>>>=20 >>>>>=20 >>>>>=20 >>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar: >>>>>> Hello, >>>>>>=20 >>>>>> I'm always in the process of setting a redundant low-cost storage = for=20 >>>>>> our (small, ~30 people) team here. >>>>>>=20 >>>>>> I read quite a lot of articles/documentations/etc and I plan to = use HAST >>>>>> with ZFS for the storage, CARP for the failover and the "good old = NFS" >>>>>> to mount the shares on the clients. >>>>>>=20 >>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks = for the >>>>>> shared storage. >>>>>>=20 >>>>>> Assuming the following configuration: >>>>>> - MASTER is the active node and BACKUP is the standby node. >>>>>> - two disks in each machine: ada0 and ada1. >>>>>> - two interfaces in each machine: em0 and em1 >>>>>> - em0 is the primary interface (with CARP setup) >>>>>> - em1 is dedicated to the HAST traffic (crossover cable) >>>>>> - FreeBSD is properly installed in each machine. >>>>>> - a HAST resource "disk0" for ada0p2. >>>>>> - a HAST resource "disk1" for ada1p2. >>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is = created >>>>>> on MASTER >>>>>>=20 >>>>>> A couple of questions I am still wondering: >>>>>> - If a disk dies on the MASTER I guess that zpool will not see it = and >>>>>> will transparently use the one on BACKUP through the HAST = ressource.. >>>>>=20 >>>>> thats right, as long as writes on $anything have been successful = hast is >>>>> happy and wont start whining >>>>>=20 >>>>>> is it a problem?=20 >>>>>=20 >>>>> imho yes, at least from management view >>>>>=20 >>>>>> could this lead to some corruption? >>>>>=20 >>>>> probably, i never heard about anyone who uses that for long time = in >>>>> production >>>>>=20 >>>>> At this stage the >>>>>> common sense would be to replace the disk quickly, but imagine = the >>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see = it=20 >>>>>> and will transparently use the one from the BACKUP node (through = the=20 >>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will = not=20 >>>>>> see it and will transparently use the one from the BACKUP node=20 >>>>>> (through the "disk0" HAST ressource). At this point on MASTER the = two=20 >>>>>> disks are broken but the pool is still considered healthy ... = What if=20 >>>>>> after that we unplug the em0 network cable on BACKUP? Storage is >>>>>> down.. >>>>>> - Under heavy I/O the MASTER box suddently dies (for some = reasons),=20 >>>>>> thanks to CARP the BACKUP node will switch from standy -> active = and=20 >>>>>> execute the failover script which does some "hastctl role = primary" for >>>>>> the ressources and a zpool import. I wondered if there are any >>>>>> situations where the pool couldn't be imported (=3D data = corruption)? >>>>>> For example what if the pool hasn't been exported on the MASTER = before >>>>>> it dies? >>>>>> - Is it a problem if the NFS daemons are started at boot on the = standby >>>>>> node, or should they only be started in the failover script? What >>>>>> about stale files and active connections on the clients? >>>>>=20 >>>>> sometimes stale mounts recover, sometimes not, sometimes clients = need >>>>> even reboots >>>>>=20 >>>>>> - A catastrophic power failure occur and MASTER and BACKUP are = suddently >>>>>> powered down. Later the power returns, is it possible that some >>>>>> problem occur (split-brain scenario ?) regarding the order in = which the >>>>>=20 >>>>> sure, you need an exact procedure to recover >>>>>=20 >>>>>> two machines boot up? >>>>>=20 >>>>> best practice should be to keep everything down after boot >>>>>=20 >>>>>> - Other things I have not thought? >>>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>> Thanks! >>>>>> Julien >>>>>>=20 >>>>>=20 >>>>>=20 >>>>> imho: >>>>>=20 >>>>> leave hast where it is, go for zfs replication. will save your = butt, >>>>> sooner or later if you avoid this fragile combination >>>>=20 >>>> I was also replying, and finishing by this : >>>> Why don't you set your slave as an iSCSI target and simply do ZFS = mirroring ? >>>=20 >>> Yes that's another option, so a zpool with two mirrors (local +=20 >>> exported iSCSI) ? >>=20 >> Yes, you would then have a real time replication solution (as HAST), = compared to ZFS send/receive which is not. >> Depends on what you need :) >=20 > More a real time replication solution in fact ... :) > Do you have any resource which resume all the pro(s) and con(s) of = HAST > vs iSCSI ? I have found a lot of article on ZFS + HAST but not that = much > with ZFS + iSCSI ..=20 # No resources, but some ideas : - ZFS likes to see all the details of its underlying disks, which is = possible with local disks (of course) and iSCSI disks, not with HAST. - iSCSI solution is simpler, you only have ZFS to manage, your = replication is made by ZFS itself, not by an additional stack. - HAST does not seem to be really maintained (I may be wrong), at least = compared to DRBD HAST seems to be inspired from. - You do not have to cross your fingers when you promote your slave to = master ("will ZFS be happy with my HAST replicated disks ?"), ZFS = mirrored data by itself, you only have to import [-f]. - (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI = could require more administration after a disconnection. But this could = easily be done by a script. # Some "advices" based on my findings (I'm finishing my tests of such a = solution) : Write performance will suffer from network latency, but while your 2 = nodes are in the same room, that should be OK. If you are over a long distance link, you may add several ms to each = write IO, which, depending on the use case, may be wrong, ZFS may also = be unresponsive. Max throughput is also more difficult to achieve over a high latency = link. You will have to choose network cards depending on the number of disks = and their throughput. For example, if you need to resilver a SATA disk (180MB/s), then a = simple 1GB interface (120MB/s) will be a serious bottleneck. Think about scrub too. You should have to perform some network tuning (TCP window size, jumbo = frame...) to reach your max bandwidth. Trying to saturate network link with (for example) iPerf before dealing = with iSCSI seems to be a good thing. Here are some interesting sysctl so that ZFS will not hang (too long) in = case of an unreachable iSCSI disk : kern.iscsi.ping_timeout=3D5 kern.iscsi.iscsid_timeout=3D5 kern.iscsi.login_timeout=3D5 kern.iscsi.fail_on_disconnection=3D1 (adjust the 5 seconds depending on your needs / on your network = quality). Take care when you (auto)replace disks, you may replace an iSCSI disk = with a local disk, which of course would work but would be wrong in = terms of master/slave redundancy. Use nice labels on your disks so that if you have a lot of disks in your = pool, you quickly know which one is local, which one is remote. # send/receive pro(s) : In terms of data safety, one of the interests of ZFS send/receive is = that you have a totally different target pool, which can be interesting = if ever you have a disaster with your primary pool. As a 3rd node solution ? On another site ? (as send/receive does not = suffer as iSCSI would from latency) >>>> ZFS would then know as soon as a disk is failing. >>>> And if the master fails, you only have to import (-f certainly, in = case of a master power failure) on the slave. >>>>=20 >>>> Ben