From owner-freebsd-fs@freebsd.org Fri Oct 30 13:07:16 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 74170A218BA for ; Fri, 30 Oct 2015 13:07:16 +0000 (UTC) (envelope-from josh@tcbug.org) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3204017A2 for ; Fri, 30 Oct 2015 13:07:15 +0000 (UTC) (envelope-from josh@tcbug.org) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id D5CE72091D for ; Fri, 30 Oct 2015 09:07:13 -0400 (EDT) Received: from frontend1 ([10.202.2.160]) by compute3.internal (MEProxy); Fri, 30 Oct 2015 09:07:13 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-sasl-enc:x-sasl-enc; s=smtpout; bh=hU0ozuBrxUA/6EK ThIvvKBOUH1U=; b=UI8PRQqxscBxyC5h2nbPcOStsB+5hak91oW3eWz9N0pR3HN btjFpBYgtf/u9IeAZeag1w2IZQWXsTtG9yuz9coVelLC0Vr/VTYxN8CnvwT7WLZO j2a6ODFtnPrt0OrnwcYQnQbcrD002Uo/ENWbOgHGfdh8YVUezefzhsCJ9TYM= X-Sasl-enc: PkYsoCFZFLZ1JMXzyryVS14+7LgPNQIcQNEsQAeBVArP 1446210433 Received: from [192.168.8.142] (184-158-23-49.dyn.centurytel.net [184.158.23.49]) by mail.messagingengine.com (Postfix) with ESMTPA id 60200C00092; Fri, 30 Oct 2015 09:07:13 -0400 (EDT) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable From: Josh Paetzel Mime-Version: 1.0 (1.0) Subject: Re: iSCSI/ZFS strangeness Date: Fri, 30 Oct 2015 08:06:55 -0500 Message-Id: <9D4FE448-28EC-45F6-B525-E660E3AF57B0@tcbug.org> References: <20151029015721.GA95057@mail.michaelwlucas.com> <563262C4.1040706@rlwinm.de> Cc: freebsd-fs@freebsd.org In-Reply-To: <563262C4.1040706@rlwinm.de> To: Jan Bramkamp X-Mailer: iPhone Mail (13B143) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Oct 2015 13:07:16 -0000 > On Oct 29, 2015, at 1:17 PM, Jan Bramkamp wrote: >=20 >> On 29/10/15 02:57, Michael W. Lucas wrote: >> The initiators can both access the iSCSI-based pool--not >> simultaneously, of course. But CARP, devd, and some shell scripting >> should get me a highly available pool that can withstand the demise of >> any one iSCSI server and any one initiator. >>=20 >> The hope is that the pool would continue to work even if an iSCSI host >> shuts down. When the downed iSCSI host returns, the initiators should >> log back in and the pool auto-resilver. >=20 > I would recommend against using CARP for this because CARP is prone to spl= it-brain situations and in this case they could destroy your whole storage p= ool. If the current head node fails the replacement has to `zpool import -f`= the pool and and in the case of a split-brain situation both head nodes wou= ld continue writing to the iSCSI targets. >=20 > I would move the leader election to an external service like consul, etcd o= r zookeeper. This is one case where the added complexity is worth it. If you= can't run an external service for this e.g. it would exceed the scope of th= e chapter you're writing please simplify the setup with more reliable hardwa= re, good monitoring and manual failover for maintenance. CARP isn't designed= to implement reliable (enough) master election for your storage cluster. >=20 > Adding iSCSI to your storage stack adds complexity and overhead. For setup= s which still fit inside a single rack SAS (with geom_multipath) is normally= faster and cheaper. On the other hand you can't spread out SAS storage far e= nough to implement disaster tolerance should you really need it and it certa= inly is an setup. I'll impart some wisdom here. 1) HA with two nodes is impossible to do right. You need a third system to a= chieve quorum. 2) You can do SAS over optical these days. Perfect for having mirrored JBODs= in different fire suppression zones of a datacenter. 3) I've seen a LOT of "cobbled together with shell script" HA rigs. They mo= stly get disabled eventually as it's realized that they go split brain in th= e edge cases and destroy the storage. What we did was go passive/passive an= d then address those cases as "how could we have avoided going passive/passi= ve". It took two years. 4) Leverage mav@'s ALUA support. For block access this will make your life m= uch easier. 5) Give me a call. I type slow and tend to leave things out, but would happi= ly do one or more brain dump sessions.=