Date: Thu, 27 Jun 2013 17:43:17 +0800 From: Marcelo Araujo <araujobsdport@gmail.com> To: mxb <mxb@alumni.chalmers.se> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zpool export/import on failover - The pool metadata is corrupted Message-ID: <CAOfEmZj=12VOEv6RRQUAmRtm6Mp%2BxHo47DwT%2BwmUDqmRyQJU3w@mail.gmail.com> In-Reply-To: <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se> References: <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se> <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com> <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se>
next in thread | previous in thread | raw e-mail | index | archive | help
For this failover solution, did you create a heartbeat or something such like that? How do you avoid split-brain? Best Regards. 2013/6/27 mxb <mxb@alumni.chalmers.se> > > Notation for archives. > > I have so far not experienced any problems with both local (per head unit) > and external (on disk enclosure) caches while importing > and exporting my pool. Disks I use on both nodes are identical - > manufacturer, size, model. > > da1,da2 - local > da32,da33 - external > > Export/import is done WITHOUT removing/adding local disks. > > root@nfs1:/root # zpool status > pool: jbod > state: ONLINE > scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 2013 > config: > > NAME STATE READ WRITE CKSUM > jbod ONLINE 0 0 0 > raidz3-0 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > da12 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > da17 ONLINE 0 0 0 > da18 ONLINE 0 0 0 > da19 ONLINE 0 0 0 > logs > mirror-1 ONLINE 0 0 0 > da32s1 ONLINE 0 0 0 > da33s1 ONLINE 0 0 0 > cache > da32s2 ONLINE 0 0 0 > da33s2 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > > On 25 jun 2013, at 21:22, mxb <mxb@alumni.chalmers.se> wrote: > > > > > I think I'v found the root of this issue. > > Looks like "wiring down" disks the same way on both nodes (as suggested) > fixes this issue. > > > > //mxb > > > > On 20 jun 2013, at 12:30, mxb <mxb@alumni.chalmers.se> wrote: > > > >> > >> Well, > >> > >> I'm back to square one. > >> > >> After some uptime and successful import/export from one node to > another, I eventually got 'metadata corruption'. > >> I had no problem with import/export while for ex. rebooting master-node > (nfs1), but not THIS time. > >> Metdata got corrupted while rebooting master-node?? > >> > >> Any ideas? > >> > >> [root@nfs1 ~]# zpool import > >> pool: jbod > >> id: 7663925948774378610 > >> state: FAULTED > >> status: The pool metadata is corrupted. > >> action: The pool cannot be imported due to damaged devices or data. > >> see: http://illumos.org/msg/ZFS-8000-72 > >> config: > >> > >> jbod FAULTED corrupted data > >> raidz3-0 ONLINE > >> da3 ONLINE > >> da4 ONLINE > >> da5 ONLINE > >> da6 ONLINE > >> da7 ONLINE > >> da8 ONLINE > >> da9 ONLINE > >> da10 ONLINE > >> da11 ONLINE > >> da12 ONLINE > >> cache > >> da13s2 > >> da14s2 > >> logs > >> mirror-1 ONLINE > >> da13s1 ONLINE > >> da14s1 ONLINE > >> [root@nfs1 ~]# zpool import jbod > >> cannot import 'jbod': I/O error > >> Destroy and re-create the pool from > >> a backup source. > >> [root@nfs1 ~]# > >> > >> On 11 jun 2013, at 10:46, mxb <mxb@alumni.chalmers.se> wrote: > >> > >>> > >>> Thanks everyone whom replied. > >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure > to my problem. > >>> > >>> Next is to test with add/remove after import/export as Jeremy > suggested. > >>> > >>> //mxb > >>> > >>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote: > >>> > >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: > >>>>> > >>>>> Sure, script is not perfects yet and does not handle many of stuff, > but moving highlight from zpool import/export to the script itself not that > >>>>> clever,as this works most of the time. > >>>>> > >>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes. > >>>>> I'v seen stale of zpool then manually importing/exporting pool. > >>>>> > >>>>> > >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote: > >>>>> > >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: > >>>>>>> > >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER > (devd.conf, and script for lifting): > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/devd.conf > >>>>>>> > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_UP"; > >>>>>>> action "/etc/zfs_switch.sh active"; > >>>>>>> }; > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_DOWN"; > >>>>>>> action "/etc/zfs_switch.sh backup"; > >>>>>>> }; > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh > >>>>>>> #!/bin/sh > >>>>>>> > >>>>>>> DATE=`date +%Y%m%d` > >>>>>>> HOSTNAME=`hostname` > >>>>>>> > >>>>>>> ZFS_POOL="jbod" > >>>>>>> > >>>>>>> > >>>>>>> case $1 in > >>>>>>> active) > >>>>>>> echo "Switching to ACTIVE and importing ZFS" | > mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root > >>>>>>> sleep 10 > >>>>>>> /sbin/zpool import -f jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> backup) > >>>>>>> echo "Switching to BACKUP and exporting ZFS" | > mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root > >>>>>>> /sbin/zpool export jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> *) > >>>>>>> exit 0 > >>>>>>> ;; > >>>>>>> esac > >>>>>>> > >>>>>>> This works, most of the time, but sometimes I'm forced to > re-create pool. Those machines suppose to go into prod. > >>>>>>> Loosing pool(and data inside it) stops me from deploy this setup. > >>>>>> > >>>>>> This script looks highly error-prone. Hasty hasty... :-) > >>>>>> > >>>>>> This script assumes that the "zpool" commands (import and export) > always > >>>>>> work/succeed; there is no exit code ($?) checking being used. > >>>>>> > >>>>>> Since this is run from within devd(8): where does stdout/stderr go > to > >>>>>> when running a program/script under devd(8)? Does it effectively go > >>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the > import or > >>>>>> export actually succeeded or not (the export sounds more likely to > be > >>>>>> the problem point). > >>>>>> > >>>>>> I imagine there would be some situations where the export would fail > >>>>>> (some files on filesystems under pool "jbod" still in use), yet > CARP is > >>>>>> already blindly assuming everything will be fantastic. Surprise. > >>>>>> > >>>>>> I also do not know if devd.conf(5) "action" commands spawn a > sub-shell > >>>>>> (/bin/sh) or not. If they don't, you won't be able to use things > like" > >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You > >>>>>> would then need to implement the equivalent of logging within your > >>>>>> zfs_switch.sh script. > >>>>>> > >>>>>> You may want to consider the -f flag to zpool import/export > >>>>>> (particularly export). However there are risks involved -- userland > >>>>>> applications which have an fd/fh open on a file which is stored on a > >>>>>> filesystem that has now completely disappeared can sometimes crash > >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on > how > >>>>>> they're designed. > >>>>>> > >>>>>> Basically what I'm trying to say is that devd(8) being used as a > form of > >>>>>> HA (high availability) and load balancing is not always possible. > >>>>>> Real/true HA (especially with SANs) is often done very differently > (now > >>>>>> you know why it's often proprietary. :-) ) > >>>> > >>>> Add error checking to your script. That's my first and foremost > >>>> recommendation. It's not hard to do, really. :-) > >>>> > >>>> After you do that and still experience the issue (e.g. you see no > actual > >>>> errors/issues during the export/import phases), I recommend removing > >>>> the "cache" devices which are "independent" on each system from the > pool > >>>> entirely. Quoting you (for readers, since I snipped it from my > previous > >>>> reply): > >>>> > >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC > >>>>>>> is both local and external - da1,da2, da13s2, da14s2 > >>>> > >>>> I interpret this to mean the primary and backup nodes (physical > systems) > >>>> have actual disks which are not part of the "external enclosure". If > >>>> that's the case -- those disks are always going to vary in their > >>>> contents and metadata. Those are never going to be 100% identical all > >>>> the time (is this not obvious?). I'm surprised your stuff has worked > at > >>>> all using that model, honestly. > >>>> > >>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain > >>>> things, all the way down to the L2ARC. That's my understanding of it > at > >>>> least, meaning there must always be "some" kind of metadata that has > to > >>>> be kept/maintained there. > >>>> > >>>> Alternately you could try doing this: > >>>> > >>>> zpool remove jbod cache daX daY ... > >>>> zpool export jbod > >>>> > >>>> Then on the other system: > >>>> > >>>> zpool import jbod > >>>> zpool add jbod cache daX daY ... > >>>> > >>>> Where daX and daY are the disks which are independent to each system > >>>> (not on the "external enclosure"). > >>>> > >>>> Finally, it would also be useful/worthwhile if you would provide > >>>> "dmesg" from both systems and for you to explain the physical wiring > >>>> along with what device (e.g. daX) correlates with what exact thing on > >>>> each system. (We right now have no knowledge of that, and your terse > >>>> explanations imply we do -- we need to know more) > >>>> > >>>> -- > >>>> | Jeremy Chadwick jdc@koitsu.org | > >>>> | UNIX Systems Administrator http://jdc.koitsu.org/ | > >>>> | Making life hard for others since 1977. PGP 4BD6C0CB | > >>>> > >>> > >> > > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > -- Marcelo Araujo araujo@FreeBSD.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOfEmZj=12VOEv6RRQUAmRtm6Mp%2BxHo47DwT%2BwmUDqmRyQJU3w>