Date: Thu, 27 Jun 2013 12:22:32 +0200 From: mxb <mxb@alumni.chalmers.se> To: araujo@FreeBSD.org Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zpool export/import on failover - The pool metadata is corrupted Message-ID: <DEF48329-E6AC-4D03-BBDA-164D11DC72D3@alumni.chalmers.se> In-Reply-To: <CAOfEmZj=12VOEv6RRQUAmRtm6Mp%2BxHo47DwT%2BwmUDqmRyQJU3w@mail.gmail.com> References: <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se> <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com> <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se> <CAOfEmZj=12VOEv6RRQUAmRtm6Mp%2BxHo47DwT%2BwmUDqmRyQJU3w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
This solution is built on top of CARP. One of nodes is (as of advskew) a preferred master. Triggered chain is CARP -> devd -> failover_script.sh (zfs = import/export) On 27 jun 2013, at 11:43, Marcelo Araujo <araujobsdport@gmail.com> = wrote: > For this failover solution, did you create a heartbeat or something = such like that? How do you avoid split-brain? >=20 > Best Regards. >=20 >=20 > 2013/6/27 mxb <mxb@alumni.chalmers.se> >=20 > Notation for archives. >=20 > I have so far not experienced any problems with both local (per head = unit) and external (on disk enclosure) caches while importing > and exporting my pool. Disks I use on both nodes are identical - = manufacturer, size, model. >=20 > da1,da2 - local > da32,da33 - external >=20 > Export/import is done WITHOUT removing/adding local disks. >=20 > root@nfs1:/root # zpool status > pool: jbod > state: ONLINE > scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 = 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > jbod ONLINE 0 0 0 > raidz3-0 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > da12 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > da17 ONLINE 0 0 0 > da18 ONLINE 0 0 0 > da19 ONLINE 0 0 0 > logs > mirror-1 ONLINE 0 0 0 > da32s1 ONLINE 0 0 0 > da33s1 ONLINE 0 0 0 > cache > da32s2 ONLINE 0 0 0 > da33s2 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 >=20 > On 25 jun 2013, at 21:22, mxb <mxb@alumni.chalmers.se> wrote: >=20 > > > > I think I'v found the root of this issue. > > Looks like "wiring down" disks the same way on both nodes (as = suggested) fixes this issue. > > > > //mxb > > > > On 20 jun 2013, at 12:30, mxb <mxb@alumni.chalmers.se> wrote: > > > >> > >> Well, > >> > >> I'm back to square one. > >> > >> After some uptime and successful import/export from one node to = another, I eventually got 'metadata corruption'. > >> I had no problem with import/export while for ex. rebooting = master-node (nfs1), but not THIS time. > >> Metdata got corrupted while rebooting master-node?? > >> > >> Any ideas? > >> > >> [root@nfs1 ~]# zpool import > >> pool: jbod > >> id: 7663925948774378610 > >> state: FAULTED > >> status: The pool metadata is corrupted. > >> action: The pool cannot be imported due to damaged devices or data. > >> see: http://illumos.org/msg/ZFS-8000-72 > >> config: > >> > >> jbod FAULTED corrupted data > >> raidz3-0 ONLINE > >> da3 ONLINE > >> da4 ONLINE > >> da5 ONLINE > >> da6 ONLINE > >> da7 ONLINE > >> da8 ONLINE > >> da9 ONLINE > >> da10 ONLINE > >> da11 ONLINE > >> da12 ONLINE > >> cache > >> da13s2 > >> da14s2 > >> logs > >> mirror-1 ONLINE > >> da13s1 ONLINE > >> da14s1 ONLINE > >> [root@nfs1 ~]# zpool import jbod > >> cannot import 'jbod': I/O error > >> Destroy and re-create the pool from > >> a backup source. > >> [root@nfs1 ~]# > >> > >> On 11 jun 2013, at 10:46, mxb <mxb@alumni.chalmers.se> wrote: > >> > >>> > >>> Thanks everyone whom replied. > >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a = cure to my problem. > >>> > >>> Next is to test with add/remove after import/export as Jeremy = suggested. > >>> > >>> //mxb > >>> > >>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote: > >>> > >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: > >>>>> > >>>>> Sure, script is not perfects yet and does not handle many of = stuff, but moving highlight from zpool import/export to the script = itself not that > >>>>> clever,as this works most of the time. > >>>>> > >>>>> Question is WHY ZFS corrupts metadata then it should not. = Sometimes. > >>>>> I'v seen stale of zpool then manually importing/exporting pool. > >>>>> > >>>>> > >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote: > >>>>> > >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: > >>>>>>> > >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER = (devd.conf, and script for lifting): > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/devd.conf > >>>>>>> > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_UP"; > >>>>>>> action "/etc/zfs_switch.sh active"; > >>>>>>> }; > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_DOWN"; > >>>>>>> action "/etc/zfs_switch.sh backup"; > >>>>>>> }; > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh > >>>>>>> #!/bin/sh > >>>>>>> > >>>>>>> DATE=3D`date +%Y%m%d` > >>>>>>> HOSTNAME=3D`hostname` > >>>>>>> > >>>>>>> ZFS_POOL=3D"jbod" > >>>>>>> > >>>>>>> > >>>>>>> case $1 in > >>>>>>> active) > >>>>>>> echo "Switching to ACTIVE and importing ZFS" | = mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root > >>>>>>> sleep 10 > >>>>>>> /sbin/zpool import -f jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> backup) > >>>>>>> echo "Switching to BACKUP and exporting ZFS" | = mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root > >>>>>>> /sbin/zpool export jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> *) > >>>>>>> exit 0 > >>>>>>> ;; > >>>>>>> esac > >>>>>>> > >>>>>>> This works, most of the time, but sometimes I'm forced to = re-create pool. Those machines suppose to go into prod. > >>>>>>> Loosing pool(and data inside it) stops me from deploy this = setup. > >>>>>> > >>>>>> This script looks highly error-prone. Hasty hasty... :-) > >>>>>> > >>>>>> This script assumes that the "zpool" commands (import and = export) always > >>>>>> work/succeed; there is no exit code ($?) checking being used. > >>>>>> > >>>>>> Since this is run from within devd(8): where does stdout/stderr = go to > >>>>>> when running a program/script under devd(8)? Does it = effectively go > >>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the = import or > >>>>>> export actually succeeded or not (the export sounds more likely = to be > >>>>>> the problem point). > >>>>>> > >>>>>> I imagine there would be some situations where the export would = fail > >>>>>> (some files on filesystems under pool "jbod" still in use), yet = CARP is > >>>>>> already blindly assuming everything will be fantastic. = Surprise. > >>>>>> > >>>>>> I also do not know if devd.conf(5) "action" commands spawn a = sub-shell > >>>>>> (/bin/sh) or not. If they don't, you won't be able to use = things like" > >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. = You > >>>>>> would then need to implement the equivalent of logging within = your > >>>>>> zfs_switch.sh script. > >>>>>> > >>>>>> You may want to consider the -f flag to zpool import/export > >>>>>> (particularly export). However there are risks involved -- = userland > >>>>>> applications which have an fd/fh open on a file which is stored = on a > >>>>>> filesystem that has now completely disappeared can sometimes = crash > >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) = depending on how > >>>>>> they're designed. > >>>>>> > >>>>>> Basically what I'm trying to say is that devd(8) being used as = a form of > >>>>>> HA (high availability) and load balancing is not always = possible. > >>>>>> Real/true HA (especially with SANs) is often done very = differently (now > >>>>>> you know why it's often proprietary. :-) ) > >>>> > >>>> Add error checking to your script. That's my first and foremost > >>>> recommendation. It's not hard to do, really. :-) > >>>> > >>>> After you do that and still experience the issue (e.g. you see no = actual > >>>> errors/issues during the export/import phases), I recommend = removing > >>>> the "cache" devices which are "independent" on each system from = the pool > >>>> entirely. Quoting you (for readers, since I snipped it from my = previous > >>>> reply): > >>>> > >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only = L2ARC > >>>>>>> is both local and external - da1,da2, da13s2, da14s2 > >>>> > >>>> I interpret this to mean the primary and backup nodes (physical = systems) > >>>> have actual disks which are not part of the "external enclosure". = If > >>>> that's the case -- those disks are always going to vary in their > >>>> contents and metadata. Those are never going to be 100% = identical all > >>>> the time (is this not obvious?). I'm surprised your stuff has = worked at > >>>> all using that model, honestly. > >>>> > >>>> ZFS is going to bitch/cry if it cannot verify the integrity of = certain > >>>> things, all the way down to the L2ARC. That's my understanding = of it at > >>>> least, meaning there must always be "some" kind of metadata that = has to > >>>> be kept/maintained there. > >>>> > >>>> Alternately you could try doing this: > >>>> > >>>> zpool remove jbod cache daX daY ... > >>>> zpool export jbod > >>>> > >>>> Then on the other system: > >>>> > >>>> zpool import jbod > >>>> zpool add jbod cache daX daY ... > >>>> > >>>> Where daX and daY are the disks which are independent to each = system > >>>> (not on the "external enclosure"). > >>>> > >>>> Finally, it would also be useful/worthwhile if you would provide > >>>> "dmesg" from both systems and for you to explain the physical = wiring > >>>> along with what device (e.g. daX) correlates with what exact = thing on > >>>> each system. (We right now have no knowledge of that, and your = terse > >>>> explanations imply we do -- we need to know more) > >>>> > >>>> -- > >>>> | Jeremy Chadwick = jdc@koitsu.org | > >>>> | UNIX Systems Administrator = http://jdc.koitsu.org/ | > >>>> | Making life hard for others since 1977. PGP = 4BD6C0CB | > >>>> > >>> > >> > > >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20 >=20 >=20 > --=20 > Marcelo Araujo > araujo@FreeBSD.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DEF48329-E6AC-4D03-BBDA-164D11DC72D3>