From owner-freebsd-fs@FreeBSD.ORG Tue Jun 25 19:22:49 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C3B0965B for ; Tue, 25 Jun 2013 19:22:49 +0000 (UTC) (envelope-from mxb@alumni.chalmers.se) Received: from mail-lb0-x234.google.com (mail-lb0-x234.google.com [IPv6:2a00:1450:4010:c04::234]) by mx1.freebsd.org (Postfix) with ESMTP id 4476E1F35 for ; Tue, 25 Jun 2013 19:22:48 +0000 (UTC) Received: by mail-lb0-f180.google.com with SMTP id o10so1100195lbi.39 for ; Tue, 25 Jun 2013 12:22:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=3gncpQNjN76vdmVXPvZDG9Ue/Z3yrOBTP24vR51N2vs=; b=ONpAqmjsO4NdjcDfKvWFveDXacruC4/J+9oZJeISYKpwelbazzflSdTjK6kAwd97rq dvmESr5VAURAnUzhUKa7CuxkPz377vnFikbLijg/uMFZXZF9dYjMtqUBrNJptvr/iqVn mbM9NEzOnm63BgRIl9aYhz8BgqdmFhRi+aFmSk8NULnaoD0IMq6XTYUXeVB9RAs1wIQO cwIjTjRNvOtfr3hPyrqC0eMPNahF6ZvYTPespQoAB7scPSuHNCCEH1hdHPIAJN34NS2B Ylqs/TQ0xowScsRflgui8Ba6G+Z008YnZkJJAZUjwlrOllqBx++UdgIPWk88GOWhFhLT E7Rw== X-Received: by 10.112.125.199 with SMTP id ms7mr573591lbb.29.1372188167277; Tue, 25 Jun 2013 12:22:47 -0700 (PDT) Received: from grey.home.unixconn.com (h-75-17.a183.priv.bahnhof.se. [46.59.75.17]) by mx.google.com with ESMTPSA id f8sm9292240lbr.10.2013.06.25.12.22.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 25 Jun 2013 12:22:45 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: zpool export/import on failover - The pool metadata is corrupted From: mxb In-Reply-To: <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> Date: Tue, 25 Jun 2013 21:22:43 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> To: Jeremy Chadwick X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQmZNUBWZzqpuwTRGyup5ZAPxjsulkP7SFMweGACJ54Qzp/Rl1RP14fmsxXeN7V/ekPtzuBj Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Jun 2013 19:22:49 -0000 I think I'v found the root of this issue. Looks like "wiring down" disks the same way on both nodes (as suggested) = fixes this issue. //mxb On 20 jun 2013, at 12:30, mxb wrote: >=20 > Well, >=20 > I'm back to square one. >=20 > After some uptime and successful import/export from one node to = another, I eventually got 'metadata corruption'. > I had no problem with import/export while for ex. rebooting = master-node (nfs1), but not THIS time. > Metdata got corrupted while rebooting master-node?? >=20 > Any ideas?=20 >=20 > [root@nfs1 ~]# zpool import > pool: jbod > id: 7663925948774378610 > state: FAULTED > status: The pool metadata is corrupted. > action: The pool cannot be imported due to damaged devices or data. > see: http://illumos.org/msg/ZFS-8000-72 > config: >=20 > jbod FAULTED corrupted data > raidz3-0 ONLINE > da3 ONLINE > da4 ONLINE > da5 ONLINE > da6 ONLINE > da7 ONLINE > da8 ONLINE > da9 ONLINE > da10 ONLINE > da11 ONLINE > da12 ONLINE > cache > da13s2 > da14s2 > logs > mirror-1 ONLINE > da13s1 ONLINE > da14s1 ONLINE > [root@nfs1 ~]# zpool import jbod > cannot import 'jbod': I/O error > Destroy and re-create the pool from > a backup source. > [root@nfs1 ~]# >=20 > On 11 jun 2013, at 10:46, mxb wrote: >=20 >>=20 >> Thanks everyone whom replied. >> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure = to my problem. >>=20 >> Next is to test with add/remove after import/export as Jeremy = suggested. >>=20 >> //mxb >>=20 >> On 7 jun 2013, at 01:34, Jeremy Chadwick wrote: >>=20 >>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: >>>>=20 >>>> Sure, script is not perfects yet and does not handle many of stuff, = but moving highlight from zpool import/export to the script itself not = that >>>> clever,as this works most of the time. >>>>=20 >>>> Question is WHY ZFS corrupts metadata then it should not. = Sometimes. >>>> I'v seen stale of zpool then manually importing/exporting pool. >>>>=20 >>>>=20 >>>> On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: >>>>=20 >>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: >>>>>>=20 >>>>>> Then MASTER goes down, CARP on the second node goes MASTER = (devd.conf, and script for lifting): >>>>>>=20 >>>>>> root@nfs2:/root # cat /etc/devd.conf >>>>>>=20 >>>>>>=20 >>>>>> notify 30 { >>>>>> match "system" "IFNET"; >>>>>> match "subsystem" "carp0"; >>>>>> match "type" "LINK_UP"; >>>>>> action "/etc/zfs_switch.sh active"; >>>>>> }; >>>>>>=20 >>>>>> notify 30 { >>>>>> match "system" "IFNET"; >>>>>> match "subsystem" "carp0"; >>>>>> match "type" "LINK_DOWN"; >>>>>> action "/etc/zfs_switch.sh backup"; >>>>>> }; >>>>>>=20 >>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh >>>>>> #!/bin/sh >>>>>>=20 >>>>>> DATE=3D`date +%Y%m%d` >>>>>> HOSTNAME=3D`hostname` >>>>>>=20 >>>>>> ZFS_POOL=3D"jbod" >>>>>>=20 >>>>>>=20 >>>>>> case $1 in >>>>>> active) >>>>>> echo "Switching to ACTIVE and importing ZFS" | mail -s = ''$DATE': '$HOSTNAME' switching to ACTIVE' root >>>>>> sleep 10 >>>>>> /sbin/zpool import -f jbod >>>>>> /etc/rc.d/mountd restart >>>>>> /etc/rc.d/nfsd restart >>>>>> ;; >>>>>> backup) >>>>>> echo "Switching to BACKUP and exporting ZFS" | mail -s = ''$DATE': '$HOSTNAME' switching to BACKUP' root >>>>>> /sbin/zpool export jbod >>>>>> /etc/rc.d/mountd restart >>>>>> /etc/rc.d/nfsd restart >>>>>> ;; >>>>>> *) >>>>>> exit 0 >>>>>> ;; >>>>>> esac >>>>>>=20 >>>>>> This works, most of the time, but sometimes I'm forced to = re-create pool. Those machines suppose to go into prod. >>>>>> Loosing pool(and data inside it) stops me from deploy this setup. >>>>>=20 >>>>> This script looks highly error-prone. Hasty hasty... :-) >>>>>=20 >>>>> This script assumes that the "zpool" commands (import and export) = always >>>>> work/succeed; there is no exit code ($?) checking being used. >>>>>=20 >>>>> Since this is run from within devd(8): where does stdout/stderr go = to >>>>> when running a program/script under devd(8)? Does it effectively = go >>>>> to the bit bucket (/dev/null)? If so, you'd never know if the = import or >>>>> export actually succeeded or not (the export sounds more likely to = be >>>>> the problem point). >>>>>=20 >>>>> I imagine there would be some situations where the export would = fail >>>>> (some files on filesystems under pool "jbod" still in use), yet = CARP is >>>>> already blindly assuming everything will be fantastic. Surprise. >>>>>=20 >>>>> I also do not know if devd.conf(5) "action" commands spawn a = sub-shell >>>>> (/bin/sh) or not. If they don't, you won't be able to use things = like" >>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. = You >>>>> would then need to implement the equivalent of logging within your >>>>> zfs_switch.sh script. >>>>>=20 >>>>> You may want to consider the -f flag to zpool import/export >>>>> (particularly export). However there are risks involved -- = userland >>>>> applications which have an fd/fh open on a file which is stored on = a >>>>> filesystem that has now completely disappeared can sometimes crash >>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending = on how >>>>> they're designed. >>>>>=20 >>>>> Basically what I'm trying to say is that devd(8) being used as a = form of >>>>> HA (high availability) and load balancing is not always possible. >>>>> Real/true HA (especially with SANs) is often done very differently = (now >>>>> you know why it's often proprietary. :-) ) >>>=20 >>> Add error checking to your script. That's my first and foremost >>> recommendation. It's not hard to do, really. :-) >>>=20 >>> After you do that and still experience the issue (e.g. you see no = actual >>> errors/issues during the export/import phases), I recommend removing >>> the "cache" devices which are "independent" on each system from the = pool >>> entirely. Quoting you (for readers, since I snipped it from my = previous >>> reply): >>>=20 >>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only = L2ARC >>>>>> is both local and external - da1,da2, da13s2, da14s2 >>>=20 >>> I interpret this to mean the primary and backup nodes (physical = systems) >>> have actual disks which are not part of the "external enclosure". = If >>> that's the case -- those disks are always going to vary in their >>> contents and metadata. Those are never going to be 100% identical = all >>> the time (is this not obvious?). I'm surprised your stuff has = worked at >>> all using that model, honestly. >>>=20 >>> ZFS is going to bitch/cry if it cannot verify the integrity of = certain >>> things, all the way down to the L2ARC. That's my understanding of = it at >>> least, meaning there must always be "some" kind of metadata that has = to >>> be kept/maintained there. >>>=20 >>> Alternately you could try doing this: >>>=20 >>> zpool remove jbod cache daX daY ... >>> zpool export jbod >>>=20 >>> Then on the other system: >>>=20 >>> zpool import jbod >>> zpool add jbod cache daX daY ... >>>=20 >>> Where daX and daY are the disks which are independent to each system >>> (not on the "external enclosure"). >>>=20 >>> Finally, it would also be useful/worthwhile if you would provide=20 >>> "dmesg" from both systems and for you to explain the physical wiring >>> along with what device (e.g. daX) correlates with what exact thing = on >>> each system. (We right now have no knowledge of that, and your = terse >>> explanations imply we do -- we need to know more) >>>=20 >>> --=20 >>> | Jeremy Chadwick jdc@koitsu.org | >>> | UNIX Systems Administrator http://jdc.koitsu.org/ | >>> | Making life hard for others since 1977. PGP 4BD6C0CB | >>>=20 >>=20 >=20