From owner-freebsd-fs@FreeBSD.ORG Thu Jun 27 09:43:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 21080DE for ; Thu, 27 Jun 2013 09:43:19 +0000 (UTC) (envelope-from araujobsdport@gmail.com) Received: from mail-wi0-x22d.google.com (mail-wi0-x22d.google.com [IPv6:2a00:1450:400c:c05::22d]) by mx1.freebsd.org (Postfix) with ESMTP id 762FA1755 for ; Thu, 27 Jun 2013 09:43:18 +0000 (UTC) Received: by mail-wi0-f173.google.com with SMTP id hq4so2889961wib.12 for ; Thu, 27 Jun 2013 02:43:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=+C45zZmHxK5oibCoJbysQT40YhKvBR14B891WS+jvUM=; b=rEAcL1tgKurlAoo5z9pblwTjN0MGnm7jYIj1q6mDru8J0qCUtCWOn1+m79n/FVYXqY 6K8XzmOrsy0xx3GS6P5j2/kdCp5P6jGpV2C4pqt6VIE5Km6UGOu6T6mi3LFe/Rlpig2m 3LYiVR8kVzeiNk82lLd5n1w56d0RjJbSFZUk7jRC0PbkpnE13siqKPbtJVXcnRAyXcbi JBtfqJ6HrUfgMMtpLcWuPzUPewwuTW6aYM7AUEYeGw/dqOPJWQ5PkyWWUvOk1wTIDeZ1 c5uPVkt5koBcnccQGsZ9sui3CiUvBGk8fz3wjjPcgdOcjnGMdLeHL/y9HjmNC9/fCEmr 6YZw== MIME-Version: 1.0 X-Received: by 10.180.20.228 with SMTP id q4mr5465696wie.1.1372326197648; Thu, 27 Jun 2013 02:43:17 -0700 (PDT) Received: by 10.180.73.180 with HTTP; Thu, 27 Jun 2013 02:43:17 -0700 (PDT) In-Reply-To: <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se> References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se> Date: Thu, 27 Jun 2013 17:43:17 +0800 Message-ID: Subject: Re: zpool export/import on failover - The pool metadata is corrupted From: Marcelo Araujo To: mxb Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: araujo@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Jun 2013 09:43:19 -0000 For this failover solution, did you create a heartbeat or something such like that? How do you avoid split-brain? Best Regards. 2013/6/27 mxb > > Notation for archives. > > I have so far not experienced any problems with both local (per head unit) > and external (on disk enclosure) caches while importing > and exporting my pool. Disks I use on both nodes are identical - > manufacturer, size, model. > > da1,da2 - local > da32,da33 - external > > Export/import is done WITHOUT removing/adding local disks. > > root@nfs1:/root # zpool status > pool: jbod > state: ONLINE > scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 2013 > config: > > NAME STATE READ WRITE CKSUM > jbod ONLINE 0 0 0 > raidz3-0 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > da12 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > da17 ONLINE 0 0 0 > da18 ONLINE 0 0 0 > da19 ONLINE 0 0 0 > logs > mirror-1 ONLINE 0 0 0 > da32s1 ONLINE 0 0 0 > da33s1 ONLINE 0 0 0 > cache > da32s2 ONLINE 0 0 0 > da33s2 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > > On 25 jun 2013, at 21:22, mxb wrote: > > > > > I think I'v found the root of this issue. > > Looks like "wiring down" disks the same way on both nodes (as suggested) > fixes this issue. > > > > //mxb > > > > On 20 jun 2013, at 12:30, mxb wrote: > > > >> > >> Well, > >> > >> I'm back to square one. > >> > >> After some uptime and successful import/export from one node to > another, I eventually got 'metadata corruption'. > >> I had no problem with import/export while for ex. rebooting master-node > (nfs1), but not THIS time. > >> Metdata got corrupted while rebooting master-node?? > >> > >> Any ideas? > >> > >> [root@nfs1 ~]# zpool import > >> pool: jbod > >> id: 7663925948774378610 > >> state: FAULTED > >> status: The pool metadata is corrupted. > >> action: The pool cannot be imported due to damaged devices or data. > >> see: http://illumos.org/msg/ZFS-8000-72 > >> config: > >> > >> jbod FAULTED corrupted data > >> raidz3-0 ONLINE > >> da3 ONLINE > >> da4 ONLINE > >> da5 ONLINE > >> da6 ONLINE > >> da7 ONLINE > >> da8 ONLINE > >> da9 ONLINE > >> da10 ONLINE > >> da11 ONLINE > >> da12 ONLINE > >> cache > >> da13s2 > >> da14s2 > >> logs > >> mirror-1 ONLINE > >> da13s1 ONLINE > >> da14s1 ONLINE > >> [root@nfs1 ~]# zpool import jbod > >> cannot import 'jbod': I/O error > >> Destroy and re-create the pool from > >> a backup source. > >> [root@nfs1 ~]# > >> > >> On 11 jun 2013, at 10:46, mxb wrote: > >> > >>> > >>> Thanks everyone whom replied. > >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure > to my problem. > >>> > >>> Next is to test with add/remove after import/export as Jeremy > suggested. > >>> > >>> //mxb > >>> > >>> On 7 jun 2013, at 01:34, Jeremy Chadwick wrote: > >>> > >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: > >>>>> > >>>>> Sure, script is not perfects yet and does not handle many of stuff, > but moving highlight from zpool import/export to the script itself not that > >>>>> clever,as this works most of the time. > >>>>> > >>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes. > >>>>> I'v seen stale of zpool then manually importing/exporting pool. > >>>>> > >>>>> > >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: > >>>>> > >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: > >>>>>>> > >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER > (devd.conf, and script for lifting): > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/devd.conf > >>>>>>> > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_UP"; > >>>>>>> action "/etc/zfs_switch.sh active"; > >>>>>>> }; > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_DOWN"; > >>>>>>> action "/etc/zfs_switch.sh backup"; > >>>>>>> }; > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh > >>>>>>> #!/bin/sh > >>>>>>> > >>>>>>> DATE=`date +%Y%m%d` > >>>>>>> HOSTNAME=`hostname` > >>>>>>> > >>>>>>> ZFS_POOL="jbod" > >>>>>>> > >>>>>>> > >>>>>>> case $1 in > >>>>>>> active) > >>>>>>> echo "Switching to ACTIVE and importing ZFS" | > mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root > >>>>>>> sleep 10 > >>>>>>> /sbin/zpool import -f jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> backup) > >>>>>>> echo "Switching to BACKUP and exporting ZFS" | > mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root > >>>>>>> /sbin/zpool export jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> *) > >>>>>>> exit 0 > >>>>>>> ;; > >>>>>>> esac > >>>>>>> > >>>>>>> This works, most of the time, but sometimes I'm forced to > re-create pool. Those machines suppose to go into prod. > >>>>>>> Loosing pool(and data inside it) stops me from deploy this setup. > >>>>>> > >>>>>> This script looks highly error-prone. Hasty hasty... :-) > >>>>>> > >>>>>> This script assumes that the "zpool" commands (import and export) > always > >>>>>> work/succeed; there is no exit code ($?) checking being used. > >>>>>> > >>>>>> Since this is run from within devd(8): where does stdout/stderr go > to > >>>>>> when running a program/script under devd(8)? Does it effectively go > >>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the > import or > >>>>>> export actually succeeded or not (the export sounds more likely to > be > >>>>>> the problem point). > >>>>>> > >>>>>> I imagine there would be some situations where the export would fail > >>>>>> (some files on filesystems under pool "jbod" still in use), yet > CARP is > >>>>>> already blindly assuming everything will be fantastic. Surprise. > >>>>>> > >>>>>> I also do not know if devd.conf(5) "action" commands spawn a > sub-shell > >>>>>> (/bin/sh) or not. If they don't, you won't be able to use things > like" > >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You > >>>>>> would then need to implement the equivalent of logging within your > >>>>>> zfs_switch.sh script. > >>>>>> > >>>>>> You may want to consider the -f flag to zpool import/export > >>>>>> (particularly export). However there are risks involved -- userland > >>>>>> applications which have an fd/fh open on a file which is stored on a > >>>>>> filesystem that has now completely disappeared can sometimes crash > >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on > how > >>>>>> they're designed. > >>>>>> > >>>>>> Basically what I'm trying to say is that devd(8) being used as a > form of > >>>>>> HA (high availability) and load balancing is not always possible. > >>>>>> Real/true HA (especially with SANs) is often done very differently > (now > >>>>>> you know why it's often proprietary. :-) ) > >>>> > >>>> Add error checking to your script. That's my first and foremost > >>>> recommendation. It's not hard to do, really. :-) > >>>> > >>>> After you do that and still experience the issue (e.g. you see no > actual > >>>> errors/issues during the export/import phases), I recommend removing > >>>> the "cache" devices which are "independent" on each system from the > pool > >>>> entirely. Quoting you (for readers, since I snipped it from my > previous > >>>> reply): > >>>> > >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC > >>>>>>> is both local and external - da1,da2, da13s2, da14s2 > >>>> > >>>> I interpret this to mean the primary and backup nodes (physical > systems) > >>>> have actual disks which are not part of the "external enclosure". If > >>>> that's the case -- those disks are always going to vary in their > >>>> contents and metadata. Those are never going to be 100% identical all > >>>> the time (is this not obvious?). I'm surprised your stuff has worked > at > >>>> all using that model, honestly. > >>>> > >>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain > >>>> things, all the way down to the L2ARC. That's my understanding of it > at > >>>> least, meaning there must always be "some" kind of metadata that has > to > >>>> be kept/maintained there. > >>>> > >>>> Alternately you could try doing this: > >>>> > >>>> zpool remove jbod cache daX daY ... > >>>> zpool export jbod > >>>> > >>>> Then on the other system: > >>>> > >>>> zpool import jbod > >>>> zpool add jbod cache daX daY ... > >>>> > >>>> Where daX and daY are the disks which are independent to each system > >>>> (not on the "external enclosure"). > >>>> > >>>> Finally, it would also be useful/worthwhile if you would provide > >>>> "dmesg" from both systems and for you to explain the physical wiring > >>>> along with what device (e.g. daX) correlates with what exact thing on > >>>> each system. (We right now have no knowledge of that, and your terse > >>>> explanations imply we do -- we need to know more) > >>>> > >>>> -- > >>>> | Jeremy Chadwick jdc@koitsu.org | > >>>> | UNIX Systems Administrator http://jdc.koitsu.org/ | > >>>> | Making life hard for others since 1977. PGP 4BD6C0CB | > >>>> > >>> > >> > > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > -- Marcelo Araujo araujo@FreeBSD.org