From owner-freebsd-fs@FreeBSD.ORG Wed Mar 20 00:36:17 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 721C8D7B for ; Wed, 20 Mar 2013 00:36:17 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: from mail-ie0-x233.google.com (mail-ie0-x233.google.com [IPv6:2607:f8b0:4001:c03::233]) by mx1.freebsd.org (Postfix) with ESMTP id 26311F23 for ; Wed, 20 Mar 2013 00:36:17 +0000 (UTC) Received: by mail-ie0-f179.google.com with SMTP id k11so1455591iea.24 for ; Tue, 19 Mar 2013 17:36:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=pVZG9cCZcsCPjbuI0grN+HOCru/99LUYkfAqHxMLR10=; b=WbyUpWTH+GoyLGjcYndXM00CQ/YDKK6CgBRsCqY9KDhl+AoNIs38DyMZY+JHiG4Eql UrWZ9RDH79hR6AyYdiqoaP7Bldzt4GkUnM+1+TNkAgEyKPECvP0R+6JGmcdg6t6H9KPa /Q72/whOi5WJ6+T+xxFcupQ9P0JcUVLg1BzeunYfUTlAlSjUstGaKbSsDy5JMmaUN9go Ha/ndD4OsTJODh/bGqbAlsaX7JWbwT4k4s2lhlVYp1n3+BSkikv+McAsoUTdj2k5IpnA kFlme2TFtJIHxj6zrqwf0aUhXP5BtP54KNi4Fi7QkqRd3ZcAnEDnv89VCp+218KLqDiG Hi9w== MIME-Version: 1.0 X-Received: by 10.43.88.134 with SMTP id ba6mr12294050icc.18.1363739776128; Tue, 19 Mar 2013 17:36:16 -0700 (PDT) Sender: jdavidlists@gmail.com Received: by 10.42.153.133 with HTTP; Tue, 19 Mar 2013 17:36:15 -0700 (PDT) In-Reply-To: <6B3D0B04-9DCE-47A4-A582-08DD640E5676@deman.com> References: <6B3D0B04-9DCE-47A4-A582-08DD640E5676@deman.com> Date: Tue, 19 Mar 2013 20:36:15 -0400 X-Google-Sender-Auth: r6XxkX7Ebaq24Hvaq1nAF6pS5ts Message-ID: Subject: Re: FreeBSD & no single point of failure file service From: J David To: Michael DeMan Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Mar 2013 00:36:17 -0000 On Sat, Mar 16, 2013 at 8:48 PM, Michael DeMan wrote: > I was thinking to maybe test something out like: > > #1. A couple old Dell 2970s head units with LSI cards. > #2. One dual-port SAS chassis. > #3. Figure out what needs to happen with devd+carp in order for the head > end units to REALIBLY know when to export/import ZFS and when to advertise > NFS/iSCSI, etc. > I was trying to figure out if it could be tested with a couple of virtual machines pointed at the same shared disk image. :) > A couple catches with this of course is that for #3 there could be some > kind of unexpected heartbeat failure between the two head end units where > they both decide the other is gone and both become masters - which would > probably result in catastrophic corruption on the file system. > I think you almost need three (or more) participants, rather than two. Then, the participants elect a master, and if you don't have a majority (e.g. two out of three votes), you didn't win. Only two of them need to be connected to the actual disks. The additional voter(s) could be one or more consumers of the filesystem services, which would tend to help keep the available one winning the master role in a split-brain scenario. That probably needs to be complicated slightly, as the export/import process isn't anything like instant. So if you get that scenario where the master loses connectivity to the clients but not the FS and you still need to promote a new master -- or you want to do manual failover for maintenance reasons -- you do need to make sure "export" finishes before "import" starts. You could wait until you've been master for X seconds before starting your import (where maybe X ~= 30), and the whole world will wait with you. Another alternative would be some sort of shared permanent storage, like a non-ZFS partition or drive upon which the master writes a timestamp, and the slave reads it. You don't touch the drives until either the timestamp says it's all clear or the timestamp is X seconds old. But then you run into all those goofy shared disk read caching issues, and I'm not at all sure you can peek at one partition of a SAS drive while another partition is mounted on another system. (The alternative being to dedicate two drives for that purposes, which two drives to share one 512 byte sector sounds terribly wasteful.) The third possibility would be to do it without shared storage: a machine could just broadcast "I'm touching the drives!" every second and a newly-elected master would have to wait until those messages stop for X seconds or until it sees "I'm not touching the drives!" before proceeding. That would be a little less reliable if the newly-elected master rebooted unless each machine keeps a persistent copy in local storage. In that scheme, you would just have to make sure you started/stopped things in the right order. Start: 1. Start greedy shouter. 2. Import ZFS pool. 3. ifup service interface. (Arguably doesn't even need CARP at this point.) 4. Start NFS/iSCSI Stop: 1. Stop NFS/iSCSI. 2. Ifdown service interface. 3. Export ZFS pool. 4. Stop greedy shouter. CARP loses a lot of value because it's not like TCP sessions for NFS or iSCSI can live migrate between machines anyway, but might still be useful to make sure the interface IPs have the same MAC address. Either way, I think the interface in question should be explicitly marked up/down rather than utilizing CARP for automatic interface failover. I don't think it's a good idea for a service IP to jump to a machine if it's 100% certain that that machine won't be ready. That is particularly true in the case of a previously-down master returning to service alongside a working new master. Of course the simplest solution of all is just to not implement automated failover right away. If the machines are there and configured and there is 24x7 admin, just make sure they always boot up in standby mode and have to be manually promoted to master. The time it would take for an admin to log in to the standby server and type "the_student_is_now_the_master.sh" is still probably a huge improvement over whatever the present state of affairs is. :) That would allow some time to examine real-world failure cases in a bit more detail, observe the decisions the admin makes about when to fail over, and maybe come up with a better / more resilient design that better models those decisions. SuperMicro does have that one chassis that accepts lots of drives and two > custom motherboards that are linked internally via 10GB - I think ixsystems > uses that. So in theory the edge case of the accidental 'master/master' > configuration is helped by hardware. By the same token I am skeptical of > having both head end units in a single chassis. Pardon me for being > paranoid. > I tried to convince myself "it's OK as long as they only common part is sheet metal." But yes, I've seen that and as cool as it looks, it makes me nervous too. > The hard work is always in the details, not the design? > Too right. Of course there's a whole other category of problems, like those where ZFS can run with a failed cache dev but sometimes won't import without it. Hopefully those types of problems are mostly behind us. I know I still read a lot of stuff on this list about ZFS that makes me even more nervous than putting all my eggs in one sheet metal basket. Thanks!