From owner-freebsd-fs@FreeBSD.ORG Sun Mar 17 01:00:08 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 842E8886 for ; Sun, 17 Mar 2013 01:00:08 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 4A523A81 for ; Sun, 17 Mar 2013 01:00:07 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id 137B9133BB6AE; Sat, 16 Mar 2013 18:00:07 -0700 (PDT) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OjWAJ6Ff0nzp; Sat, 16 Mar 2013 18:00:05 -0700 (PDT) Received: from [192.168.113.203] (75-151-97-138-washington.hfc.comcastbusiness.net [75.151.97.138]) by plato.corp.nas.com (Postfix) with ESMTPSA id EF1D6133BB6A3; Sat, 16 Mar 2013 18:00:04 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: FreeBSD & no single point of failure file service From: Michael DeMan In-Reply-To: <6B3D0B04-9DCE-47A4-A582-08DD640E5676@deman.com> Date: Sat, 16 Mar 2013 18:00:03 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <6B3D0B04-9DCE-47A4-A582-08DD640E5676@deman.com> To: J David , freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.1499) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Mar 2013 01:00:08 -0000 Errata.. --- by 'out of band', for my case simply another ethernet link that by = convention is physically separate from the 'primary' storage ethernet. = Good enough for my use case. --- On (F) below - meant 'if neither head unit can decide whether it = should be the master or not' - then they both deny services. Better = that bugs cause outages rather than data loss? - Mike On Mar 16, 2013, at 5:48 PM, Michael DeMan wrote: > Hi David, >=20 > We are looking at the exact same thing - let me know what you find = out. >=20 > I think it is pretty obvious that ixsystems.com has this figured out = along with all the tricky details - but for the particular company I am = looking to implement this for - vendors that can't show their prices for = products are vendors we have to stay away from because not showing = pricing means it starts at $100K minimum + giant annual support fees. = In all honesty some kind of 3rd party designed solution with only = minimal support would be fine for us, but I don't think that is their = regular market. >=20 > I was thinking to maybe test something out like: >=20 > #1. A couple old Dell 2970s head units with LSI cards. > #2. One dual-port SAS chassis. > #3. Figure out what needs to happen with devd+carp in order for the = head end units to REALIBLY know when to export/import ZFS and when to = advertise NFS/iSCSI, etc. >=20 > A couple catches with this of course is that for #3 there could be = some kind of unexpected heartbeat failure between the two head end units = where they both decide the other is gone and both become masters - which = would probably result in catastrophic corruption on the file system. >=20 > SuperMicro does have that one chassis that accepts lots of drives and = two custom motherboards that are linked internally via 10GB - I think = ixsystems uses that. So in theory the edge case of the accidental = 'master/master' configuration is helped by hardware. By the same token = I am skeptical of having both head end units in a single chassis. = Pardon me for being paranoid. >=20 > So what I came to the conclusion with #3 for home-brew design was that = devd+carp is great overall, but there needs to be an additional = out-of-band confirmation between the two head end units. >=20 >=20 > Scenario is: >=20 > #1-#2 above. >=20 > The head units are wired up such that they are providing storage and = also running (hsrp/carp/vrrp) on their main link that they vend their = storage resources off to the network. >=20 > They are also connected via another channel - this could be a x-over = ethernet link, serial cable - or in my case simply re-use the dedicated = ethernet port that is used for management-only access to the servers and = is already out of band. >=20 > If a network engineer comes and tweaks around the NFS/iSCSI switches = or something else, makes a mistake, and that link between the two head = end units is broken - both machines are going to want to be masters, and = write directly to whatever shared physical storage they have? >=20 > This is where the additional link between the head units comes in. = Storage delivery side of things has 'split brain' - head end units can = not talk to each other, but may be able to talk to some (or all) clients = that use their services. With current design for ZFS v28 there can be = only one master for utilizing the physical attached storage from the = head ends - otherwise small problem that could have been better fixed by = just having an outage turns into a potential loss of all the data = everywhere? >=20 > So basically failover between the head units works as follows: >=20 > A) I am secondary on the big storage ethernet link and the primary has = timed out on telling me it is still alive. > B) Confirm on the out-of-band link whether the primary is still up or = not, and what it thinks the state of affairs may be. (optimize by = starting this check 1st time primary heartbeat is lost - not after = timeout?) > C) If the primary thinks it has lost connectivity to the clients then = confirm it is also not longer acting as a primary for the physical = storage, and I should attach the storage and try to become the primary. > D) ??? If the primary thinks it still can connect to the clients, then = what? > E) =46rom (C) above - lets be sure to avoid a flapping situation. > F) No matter what, if the state of which head end unit should be the = 'master' (vending NFS/iSCSI and also handling the physical storage) - = then both units should deny services? >=20 >=20 > Longer e-mail than I expected. Thanks for the post - it made me think = about things. Probably there are huge problems in my above synopsis. = The hard work is always in the details, not the design? > - Mike >=20 >=20 >=20 >=20 >=20 >=20 >=20 > On Mar 9, 2013, at 3:40 PM, J David wrote: >=20 >> Hello, >>=20 >> I would like to build a file server with no single point of failure, = and I >> would like to use FreeBSD and ZFS to do it. >>=20 >> The hardware configuration we're looking at would be two servers with = 4x >> SAS connectors and two SAS JBOD shelves. Both servers would have = dual >> connections to each shelf. >>=20 >> The disks would be configured in mirrored pairs, with one disk from = each >> pair in each shelf. One pair for ZIL, one or two pairs for L2ARC, = and the >> rest for ZFS data. >>=20 >> We would be shooting for an active/standby configuration where the = standby >> system is booted up but doesn't touch the bus unless/until it detects = CARP >> failover from the master via devd, then it does a zpool import. = (Even so >> all TCP sessions for NFS and iSCSI will get reset, which seems = unavoidable >> but recoverable.) >>=20 >> This will be really expensive to test, so I would be very interested = if >> anyone has feedback on how FreeBSD will handle this type of = shared-SAS >> hardware configuration. >>=20 >> Thanks for any advice! >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20