From owner-freebsd-fs@FreeBSD.ORG Thu Jun 27 09:36:02 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2C36CE4D for ; Thu, 27 Jun 2013 09:36:02 +0000 (UTC) (envelope-from mxb@alumni.chalmers.se) Received: from mail-lb0-f181.google.com (mail-lb0-f181.google.com [209.85.217.181]) by mx1.freebsd.org (Postfix) with ESMTP id A3B4616DB for ; Thu, 27 Jun 2013 09:36:01 +0000 (UTC) Received: by mail-lb0-f181.google.com with SMTP id w10so295820lbi.12 for ; Thu, 27 Jun 2013 02:36:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=OtHsNTKiHZv9XX4uaDcyl1Ipi91MeiG+g4rqs12king=; b=QwHXmdiv5/qejaGAd6JcUVzYLYV66eX7g1d7ulsu6TGsooJoXQCZvk2civs1pUrn3R hjO6QPmU+YYyDHXAs8pLuFhO0yf74ePlqOCpkFjNj6E3tM0IWXvNJMc+JvAZdJ9FABtT SqtxyfoDA6BvjTBaEq00azq1JnJLeTawT35ILZHvyaKqcuA1DZ+GGIgev+5Ch/bj6r90 YNP+KsZRe1klFoKN+nr5dh9pq4qE2iN2Yu5toHLqV9mu+X1+frBbUJJafkq1tqGWmXVK etzOz6prH1tei1cc48f1J+ZzWTofRVqIo4w9TRmvPTa3edRKbzi9/CfZQEsOGOtw01Bk 8QzA== X-Received: by 10.112.55.104 with SMTP id r8mr3744599lbp.49.1372322065941; Thu, 27 Jun 2013 01:34:25 -0700 (PDT) Received: from grey.office.se.prisjakt.nu ([212.16.170.194]) by mx.google.com with ESMTPSA id b8sm734385lbr.12.2013.06.27.01.34.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 27 Jun 2013 01:34:24 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: zpool export/import on failover - The pool metadata is corrupted From: mxb In-Reply-To: <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> Date: Thu, 27 Jun 2013 10:34:22 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se> References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> To: Jeremy Chadwick X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQkmM35BZ/S/+dcIYyM3p74hNWXZfeqAQE9PB7Q4iBzkk9tfxGgPh0Jmej+293YeGWIqVWAj Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Jun 2013 09:36:02 -0000 Notation for archives. I have so far not experienced any problems with both local (per head = unit) and external (on disk enclosure) caches while importing and exporting my pool. Disks I use on both nodes are identical - = manufacturer, size, model. da1,da2 - local da32,da33 - external Export/import is done WITHOUT removing/adding local disks.=20 root@nfs1:/root # zpool status pool: jbod state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 = 2013 config: NAME STATE READ WRITE CKSUM jbod ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 da10 ONLINE 0 0 0 da11 ONLINE 0 0 0 da12 ONLINE 0 0 0 da13 ONLINE 0 0 0 da14 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 da17 ONLINE 0 0 0 da18 ONLINE 0 0 0 da19 ONLINE 0 0 0 logs mirror-1 ONLINE 0 0 0 da32s1 ONLINE 0 0 0 da33s1 ONLINE 0 0 0 cache da32s2 ONLINE 0 0 0 da33s2 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 On 25 jun 2013, at 21:22, mxb wrote: >=20 > I think I'v found the root of this issue. > Looks like "wiring down" disks the same way on both nodes (as = suggested) fixes this issue. >=20 > //mxb >=20 > On 20 jun 2013, at 12:30, mxb wrote: >=20 >>=20 >> Well, >>=20 >> I'm back to square one. >>=20 >> After some uptime and successful import/export from one node to = another, I eventually got 'metadata corruption'. >> I had no problem with import/export while for ex. rebooting = master-node (nfs1), but not THIS time. >> Metdata got corrupted while rebooting master-node?? >>=20 >> Any ideas?=20 >>=20 >> [root@nfs1 ~]# zpool import >> pool: jbod >> id: 7663925948774378610 >> state: FAULTED >> status: The pool metadata is corrupted. >> action: The pool cannot be imported due to damaged devices or data. >> see: http://illumos.org/msg/ZFS-8000-72 >> config: >>=20 >> jbod FAULTED corrupted data >> raidz3-0 ONLINE >> da3 ONLINE >> da4 ONLINE >> da5 ONLINE >> da6 ONLINE >> da7 ONLINE >> da8 ONLINE >> da9 ONLINE >> da10 ONLINE >> da11 ONLINE >> da12 ONLINE >> cache >> da13s2 >> da14s2 >> logs >> mirror-1 ONLINE >> da13s1 ONLINE >> da14s1 ONLINE >> [root@nfs1 ~]# zpool import jbod >> cannot import 'jbod': I/O error >> Destroy and re-create the pool from >> a backup source. >> [root@nfs1 ~]# >>=20 >> On 11 jun 2013, at 10:46, mxb wrote: >>=20 >>>=20 >>> Thanks everyone whom replied. >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a = cure to my problem. >>>=20 >>> Next is to test with add/remove after import/export as Jeremy = suggested. >>>=20 >>> //mxb >>>=20 >>> On 7 jun 2013, at 01:34, Jeremy Chadwick wrote: >>>=20 >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: >>>>>=20 >>>>> Sure, script is not perfects yet and does not handle many of = stuff, but moving highlight from zpool import/export to the script = itself not that >>>>> clever,as this works most of the time. >>>>>=20 >>>>> Question is WHY ZFS corrupts metadata then it should not. = Sometimes. >>>>> I'v seen stale of zpool then manually importing/exporting pool. >>>>>=20 >>>>>=20 >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: >>>>>=20 >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: >>>>>>>=20 >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER = (devd.conf, and script for lifting): >>>>>>>=20 >>>>>>> root@nfs2:/root # cat /etc/devd.conf >>>>>>>=20 >>>>>>>=20 >>>>>>> notify 30 { >>>>>>> match "system" "IFNET"; >>>>>>> match "subsystem" "carp0"; >>>>>>> match "type" "LINK_UP"; >>>>>>> action "/etc/zfs_switch.sh active"; >>>>>>> }; >>>>>>>=20 >>>>>>> notify 30 { >>>>>>> match "system" "IFNET"; >>>>>>> match "subsystem" "carp0"; >>>>>>> match "type" "LINK_DOWN"; >>>>>>> action "/etc/zfs_switch.sh backup"; >>>>>>> }; >>>>>>>=20 >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh >>>>>>> #!/bin/sh >>>>>>>=20 >>>>>>> DATE=3D`date +%Y%m%d` >>>>>>> HOSTNAME=3D`hostname` >>>>>>>=20 >>>>>>> ZFS_POOL=3D"jbod" >>>>>>>=20 >>>>>>>=20 >>>>>>> case $1 in >>>>>>> active) >>>>>>> echo "Switching to ACTIVE and importing ZFS" | = mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root >>>>>>> sleep 10 >>>>>>> /sbin/zpool import -f jbod >>>>>>> /etc/rc.d/mountd restart >>>>>>> /etc/rc.d/nfsd restart >>>>>>> ;; >>>>>>> backup) >>>>>>> echo "Switching to BACKUP and exporting ZFS" | = mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root >>>>>>> /sbin/zpool export jbod >>>>>>> /etc/rc.d/mountd restart >>>>>>> /etc/rc.d/nfsd restart >>>>>>> ;; >>>>>>> *) >>>>>>> exit 0 >>>>>>> ;; >>>>>>> esac >>>>>>>=20 >>>>>>> This works, most of the time, but sometimes I'm forced to = re-create pool. Those machines suppose to go into prod. >>>>>>> Loosing pool(and data inside it) stops me from deploy this = setup. >>>>>>=20 >>>>>> This script looks highly error-prone. Hasty hasty... :-) >>>>>>=20 >>>>>> This script assumes that the "zpool" commands (import and export) = always >>>>>> work/succeed; there is no exit code ($?) checking being used. >>>>>>=20 >>>>>> Since this is run from within devd(8): where does stdout/stderr = go to >>>>>> when running a program/script under devd(8)? Does it effectively = go >>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the = import or >>>>>> export actually succeeded or not (the export sounds more likely = to be >>>>>> the problem point). >>>>>>=20 >>>>>> I imagine there would be some situations where the export would = fail >>>>>> (some files on filesystems under pool "jbod" still in use), yet = CARP is >>>>>> already blindly assuming everything will be fantastic. Surprise. >>>>>>=20 >>>>>> I also do not know if devd.conf(5) "action" commands spawn a = sub-shell >>>>>> (/bin/sh) or not. If they don't, you won't be able to use things = like" >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. = You >>>>>> would then need to implement the equivalent of logging within = your >>>>>> zfs_switch.sh script. >>>>>>=20 >>>>>> You may want to consider the -f flag to zpool import/export >>>>>> (particularly export). However there are risks involved -- = userland >>>>>> applications which have an fd/fh open on a file which is stored = on a >>>>>> filesystem that has now completely disappeared can sometimes = crash >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending = on how >>>>>> they're designed. >>>>>>=20 >>>>>> Basically what I'm trying to say is that devd(8) being used as a = form of >>>>>> HA (high availability) and load balancing is not always possible. >>>>>> Real/true HA (especially with SANs) is often done very = differently (now >>>>>> you know why it's often proprietary. :-) ) >>>>=20 >>>> Add error checking to your script. That's my first and foremost >>>> recommendation. It's not hard to do, really. :-) >>>>=20 >>>> After you do that and still experience the issue (e.g. you see no = actual >>>> errors/issues during the export/import phases), I recommend = removing >>>> the "cache" devices which are "independent" on each system from the = pool >>>> entirely. Quoting you (for readers, since I snipped it from my = previous >>>> reply): >>>>=20 >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only = L2ARC >>>>>>> is both local and external - da1,da2, da13s2, da14s2 >>>>=20 >>>> I interpret this to mean the primary and backup nodes (physical = systems) >>>> have actual disks which are not part of the "external enclosure". = If >>>> that's the case -- those disks are always going to vary in their >>>> contents and metadata. Those are never going to be 100% identical = all >>>> the time (is this not obvious?). I'm surprised your stuff has = worked at >>>> all using that model, honestly. >>>>=20 >>>> ZFS is going to bitch/cry if it cannot verify the integrity of = certain >>>> things, all the way down to the L2ARC. That's my understanding of = it at >>>> least, meaning there must always be "some" kind of metadata that = has to >>>> be kept/maintained there. >>>>=20 >>>> Alternately you could try doing this: >>>>=20 >>>> zpool remove jbod cache daX daY ... >>>> zpool export jbod >>>>=20 >>>> Then on the other system: >>>>=20 >>>> zpool import jbod >>>> zpool add jbod cache daX daY ... >>>>=20 >>>> Where daX and daY are the disks which are independent to each = system >>>> (not on the "external enclosure"). >>>>=20 >>>> Finally, it would also be useful/worthwhile if you would provide=20 >>>> "dmesg" from both systems and for you to explain the physical = wiring >>>> along with what device (e.g. daX) correlates with what exact thing = on >>>> each system. (We right now have no knowledge of that, and your = terse >>>> explanations imply we do -- we need to know more) >>>>=20 >>>> --=20 >>>> | Jeremy Chadwick jdc@koitsu.org = | >>>> | UNIX Systems Administrator http://jdc.koitsu.org/ = | >>>> | Making life hard for others since 1977. PGP 4BD6C0CB = | >>>>=20 >>>=20 >>=20 >=20