From owner-freebsd-fs@FreeBSD.ORG Thu Jun 20 10:30:36 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9E38BF0F for ; Thu, 20 Jun 2013 10:30:36 +0000 (UTC) (envelope-from mxb@alumni.chalmers.se) Received: from mail-lb0-f176.google.com (mail-lb0-f176.google.com [209.85.217.176]) by mx1.freebsd.org (Postfix) with ESMTP id 210A71B07 for ; Thu, 20 Jun 2013 10:30:34 +0000 (UTC) Received: by mail-lb0-f176.google.com with SMTP id z5so5634814lbh.35 for ; Thu, 20 Jun 2013 03:30:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=VNPds05J4zmgi24zMe6tsGem8fZoitGCE/vIT6m2u00=; b=S7K5cyfftusf2gkxl7bUMy0ZtViKq95DOEkYdg8mKtTdWopR+7OebtRkMtc8mZhOFf HWUvNy8F7F6V12YZ/IgBBk9eQ4pnCuBomM4CDFNRgfYTZMo5iF3Wob711g6fuopP1IVe jrDrLuifSfOOxpjiXyvGjVo6aqo2wWq7T28MpED8PnQTmD1oupbxAYPC1bKfYvxEDVZU qnqy7it/u2CnuVqMBrEbQn5ySIydRp2RgZ9dLWoHUHUlf36TGaebvyPre3nx7lTduZ15 el8ikf1r/ffd0QVJYTQ4ZgfJhxKa4Fdg3Ik7H8/RzusdH44KCW24lgbTP88bGHRb0wYw znBA== X-Received: by 10.112.11.162 with SMTP id r2mr5211703lbb.41.1371724233601; Thu, 20 Jun 2013 03:30:33 -0700 (PDT) Received: from grey.home.unixconn.com (h-75-17.a183.priv.bahnhof.se. [46.59.75.17]) by mx.google.com with ESMTPSA id m14sm128285lbl.1.2013.06.20.03.30.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 20 Jun 2013 03:30:32 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: zpool export/import on failover - The pool metadata is corrupted From: mxb In-Reply-To: <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> Date: Thu, 20 Jun 2013 12:30:30 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> To: Jeremy Chadwick X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQnX/+rIzcOZ/6D4TZezGslj3TFWg/Z9CSo5dDj6cYIygSP0yvCh9mWnkXjJHed3wKkylvIs Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Jun 2013 10:30:36 -0000 Well, I'm back to square one. After some uptime and successful import/export from one node to another, = I eventually got 'metadata corruption'. I had no problem with import/export while for ex. rebooting master-node = (nfs1), but not THIS time. Metdata got corrupted while rebooting master-node?? Any ideas?=20 [root@nfs1 ~]# zpool import pool: jbod id: 7663925948774378610 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://illumos.org/msg/ZFS-8000-72 config: jbod FAULTED corrupted data raidz3-0 ONLINE da3 ONLINE da4 ONLINE da5 ONLINE da6 ONLINE da7 ONLINE da8 ONLINE da9 ONLINE da10 ONLINE da11 ONLINE da12 ONLINE cache da13s2 da14s2 logs mirror-1 ONLINE da13s1 ONLINE da14s1 ONLINE [root@nfs1 ~]# zpool import jbod cannot import 'jbod': I/O error Destroy and re-create the pool from a backup source. [root@nfs1 ~]# On 11 jun 2013, at 10:46, mxb wrote: >=20 > Thanks everyone whom replied. > Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure = to my problem. >=20 > Next is to test with add/remove after import/export as Jeremy = suggested. >=20 > //mxb >=20 > On 7 jun 2013, at 01:34, Jeremy Chadwick wrote: >=20 >> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: >>>=20 >>> Sure, script is not perfects yet and does not handle many of stuff, = but moving highlight from zpool import/export to the script itself not = that >>> clever,as this works most of the time. >>>=20 >>> Question is WHY ZFS corrupts metadata then it should not. Sometimes. >>> I'v seen stale of zpool then manually importing/exporting pool. >>>=20 >>>=20 >>> On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: >>>=20 >>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: >>>>>=20 >>>>> Then MASTER goes down, CARP on the second node goes MASTER = (devd.conf, and script for lifting): >>>>>=20 >>>>> root@nfs2:/root # cat /etc/devd.conf >>>>>=20 >>>>>=20 >>>>> notify 30 { >>>>> match "system" "IFNET"; >>>>> match "subsystem" "carp0"; >>>>> match "type" "LINK_UP"; >>>>> action "/etc/zfs_switch.sh active"; >>>>> }; >>>>>=20 >>>>> notify 30 { >>>>> match "system" "IFNET"; >>>>> match "subsystem" "carp0"; >>>>> match "type" "LINK_DOWN"; >>>>> action "/etc/zfs_switch.sh backup"; >>>>> }; >>>>>=20 >>>>> root@nfs2:/root # cat /etc/zfs_switch.sh >>>>> #!/bin/sh >>>>>=20 >>>>> DATE=3D`date +%Y%m%d` >>>>> HOSTNAME=3D`hostname` >>>>>=20 >>>>> ZFS_POOL=3D"jbod" >>>>>=20 >>>>>=20 >>>>> case $1 in >>>>> active) >>>>> echo "Switching to ACTIVE and importing ZFS" | mail -s = ''$DATE': '$HOSTNAME' switching to ACTIVE' root >>>>> sleep 10 >>>>> /sbin/zpool import -f jbod >>>>> /etc/rc.d/mountd restart >>>>> /etc/rc.d/nfsd restart >>>>> ;; >>>>> backup) >>>>> echo "Switching to BACKUP and exporting ZFS" | mail -s = ''$DATE': '$HOSTNAME' switching to BACKUP' root >>>>> /sbin/zpool export jbod >>>>> /etc/rc.d/mountd restart >>>>> /etc/rc.d/nfsd restart >>>>> ;; >>>>> *) >>>>> exit 0 >>>>> ;; >>>>> esac >>>>>=20 >>>>> This works, most of the time, but sometimes I'm forced to = re-create pool. Those machines suppose to go into prod. >>>>> Loosing pool(and data inside it) stops me from deploy this setup. >>>>=20 >>>> This script looks highly error-prone. Hasty hasty... :-) >>>>=20 >>>> This script assumes that the "zpool" commands (import and export) = always >>>> work/succeed; there is no exit code ($?) checking being used. >>>>=20 >>>> Since this is run from within devd(8): where does stdout/stderr go = to >>>> when running a program/script under devd(8)? Does it effectively = go >>>> to the bit bucket (/dev/null)? If so, you'd never know if the = import or >>>> export actually succeeded or not (the export sounds more likely to = be >>>> the problem point). >>>>=20 >>>> I imagine there would be some situations where the export would = fail >>>> (some files on filesystems under pool "jbod" still in use), yet = CARP is >>>> already blindly assuming everything will be fantastic. Surprise. >>>>=20 >>>> I also do not know if devd.conf(5) "action" commands spawn a = sub-shell >>>> (/bin/sh) or not. If they don't, you won't be able to use things = like" >>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. = You >>>> would then need to implement the equivalent of logging within your >>>> zfs_switch.sh script. >>>>=20 >>>> You may want to consider the -f flag to zpool import/export >>>> (particularly export). However there are risks involved -- = userland >>>> applications which have an fd/fh open on a file which is stored on = a >>>> filesystem that has now completely disappeared can sometimes crash >>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on = how >>>> they're designed. >>>>=20 >>>> Basically what I'm trying to say is that devd(8) being used as a = form of >>>> HA (high availability) and load balancing is not always possible. >>>> Real/true HA (especially with SANs) is often done very differently = (now >>>> you know why it's often proprietary. :-) ) >>=20 >> Add error checking to your script. That's my first and foremost >> recommendation. It's not hard to do, really. :-) >>=20 >> After you do that and still experience the issue (e.g. you see no = actual >> errors/issues during the export/import phases), I recommend removing >> the "cache" devices which are "independent" on each system from the = pool >> entirely. Quoting you (for readers, since I snipped it from my = previous >> reply): >>=20 >>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC >>>>> is both local and external - da1,da2, da13s2, da14s2 >>=20 >> I interpret this to mean the primary and backup nodes (physical = systems) >> have actual disks which are not part of the "external enclosure". If >> that's the case -- those disks are always going to vary in their >> contents and metadata. Those are never going to be 100% identical = all >> the time (is this not obvious?). I'm surprised your stuff has worked = at >> all using that model, honestly. >>=20 >> ZFS is going to bitch/cry if it cannot verify the integrity of = certain >> things, all the way down to the L2ARC. That's my understanding of it = at >> least, meaning there must always be "some" kind of metadata that has = to >> be kept/maintained there. >>=20 >> Alternately you could try doing this: >>=20 >> zpool remove jbod cache daX daY ... >> zpool export jbod >>=20 >> Then on the other system: >>=20 >> zpool import jbod >> zpool add jbod cache daX daY ... >>=20 >> Where daX and daY are the disks which are independent to each system >> (not on the "external enclosure"). >>=20 >> Finally, it would also be useful/worthwhile if you would provide=20 >> "dmesg" from both systems and for you to explain the physical wiring >> along with what device (e.g. daX) correlates with what exact thing on >> each system. (We right now have no knowledge of that, and your terse >> explanations imply we do -- we need to know more) >>=20 >> --=20 >> | Jeremy Chadwick jdc@koitsu.org | >> | UNIX Systems Administrator http://jdc.koitsu.org/ | >> | Making life hard for others since 1977. PGP 4BD6C0CB | >>=20 >=20