From owner-freebsd-fs@FreeBSD.ORG  Thu Jun 27 09:36:02 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 2C36CE4D
 for <freebsd-fs@freebsd.org>; Thu, 27 Jun 2013 09:36:02 +0000 (UTC)
 (envelope-from mxb@alumni.chalmers.se)
Received: from mail-lb0-f181.google.com (mail-lb0-f181.google.com
 [209.85.217.181])
 by mx1.freebsd.org (Postfix) with ESMTP id A3B4616DB
 for <freebsd-fs@freebsd.org>; Thu, 27 Jun 2013 09:36:01 +0000 (UTC)
Received: by mail-lb0-f181.google.com with SMTP id w10so295820lbi.12
 for <freebsd-fs@freebsd.org>; Thu, 27 Jun 2013 02:36:00 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=content-type:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to:x-mailer
 :x-gm-message-state;
 bh=OtHsNTKiHZv9XX4uaDcyl1Ipi91MeiG+g4rqs12king=;
 b=QwHXmdiv5/qejaGAd6JcUVzYLYV66eX7g1d7ulsu6TGsooJoXQCZvk2civs1pUrn3R
 hjO6QPmU+YYyDHXAs8pLuFhO0yf74ePlqOCpkFjNj6E3tM0IWXvNJMc+JvAZdJ9FABtT
 SqtxyfoDA6BvjTBaEq00azq1JnJLeTawT35ILZHvyaKqcuA1DZ+GGIgev+5Ch/bj6r90
 YNP+KsZRe1klFoKN+nr5dh9pq4qE2iN2Yu5toHLqV9mu+X1+frBbUJJafkq1tqGWmXVK
 etzOz6prH1tei1cc48f1J+ZzWTofRVqIo4w9TRmvPTa3edRKbzi9/CfZQEsOGOtw01Bk
 8QzA==
X-Received: by 10.112.55.104 with SMTP id r8mr3744599lbp.49.1372322065941;
 Thu, 27 Jun 2013 01:34:25 -0700 (PDT)
Received: from grey.office.se.prisjakt.nu ([212.16.170.194])
 by mx.google.com with ESMTPSA id b8sm734385lbr.12.2013.06.27.01.34.23
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Thu, 27 Jun 2013 01:34:24 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: zpool export/import on failover - The pool metadata is corrupted
From: mxb <mxb@alumni.chalmers.se>
In-Reply-To: <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se>
Date: Thu, 27 Jun 2013 10:34:22 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se>
References: <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se>
 <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com>
 <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se>
 <20130606223911.GA45807@icarus.home.lan>
 <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se>
 <20130606233417.GA46506@icarus.home.lan>
 <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se>
 <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se>
 <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se>
To: Jeremy Chadwick <jdc@koitsu.org>
X-Mailer: Apple Mail (2.1508)
X-Gm-Message-State: ALoCoQkmM35BZ/S/+dcIYyM3p74hNWXZfeqAQE9PB7Q4iBzkk9tfxGgPh0Jmej+293YeGWIqVWAj
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Jun 2013 09:36:02 -0000


Notation for archives.

I have so far not experienced any problems with both local (per head =
unit) and external (on disk enclosure) caches while importing
and exporting my pool. Disks I use on both nodes are identical - =
manufacturer, size, model.

da1,da2 - local
da32,da33 - external

Export/import is done WITHOUT removing/adding local disks.=20

root@nfs1:/root # zpool status
  pool: jbod
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 =
2013
config:

	NAME        STATE     READ WRITE CKSUM
	jbod        ONLINE       0     0     0
	  raidz3-0  ONLINE       0     0     0
	    da10    ONLINE       0     0     0
	    da11    ONLINE       0     0     0
	    da12    ONLINE       0     0     0
	    da13    ONLINE       0     0     0
	    da14    ONLINE       0     0     0
	    da15    ONLINE       0     0     0
	    da16    ONLINE       0     0     0
	    da17    ONLINE       0     0     0
	    da18    ONLINE       0     0     0
	    da19    ONLINE       0     0     0
	logs
	  mirror-1  ONLINE       0     0     0
	    da32s1  ONLINE       0     0     0
	    da33s1  ONLINE       0     0     0
	cache
	  da32s2    ONLINE       0     0     0
	  da33s2    ONLINE       0     0     0
	  da1       ONLINE       0     0     0
	  da2       ONLINE       0     0     0

On 25 jun 2013, at 21:22, mxb <mxb@alumni.chalmers.se> wrote:

>=20
> I think I'v found the root of this issue.
> Looks like "wiring down" disks the same way on both nodes (as =
suggested) fixes this issue.
>=20
> //mxb
>=20
> On 20 jun 2013, at 12:30, mxb <mxb@alumni.chalmers.se> wrote:
>=20
>>=20
>> Well,
>>=20
>> I'm back to square one.
>>=20
>> After some uptime and successful import/export from one node to =
another, I eventually got 'metadata corruption'.
>> I had no problem with import/export while for ex. rebooting =
master-node (nfs1), but not THIS time.
>> Metdata got corrupted while rebooting master-node??
>>=20
>> Any ideas?=20
>>=20
>> [root@nfs1 ~]# zpool import
>>  pool: jbod
>>    id: 7663925948774378610
>> state: FAULTED
>> status: The pool metadata is corrupted.
>> action: The pool cannot be imported due to damaged devices or data.
>>  see: http://illumos.org/msg/ZFS-8000-72
>> config:
>>=20
>> 	jbod        FAULTED  corrupted data
>> 	  raidz3-0  ONLINE
>> 	    da3     ONLINE
>> 	    da4     ONLINE
>> 	    da5     ONLINE
>> 	    da6     ONLINE
>> 	    da7     ONLINE
>> 	    da8     ONLINE
>> 	    da9     ONLINE
>> 	    da10    ONLINE
>> 	    da11    ONLINE
>> 	    da12    ONLINE
>> 	cache
>> 	  da13s2
>> 	  da14s2
>> 	logs
>> 	  mirror-1  ONLINE
>> 	    da13s1  ONLINE
>> 	    da14s1  ONLINE
>> [root@nfs1 ~]# zpool import jbod
>> cannot import 'jbod': I/O error
>> 	Destroy and re-create the pool from
>> 	a backup source.
>> [root@nfs1 ~]#
>>=20
>> On 11 jun 2013, at 10:46, mxb <mxb@alumni.chalmers.se> wrote:
>>=20
>>>=20
>>> Thanks everyone whom replied.
>>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a =
cure to my problem.
>>>=20
>>> Next is to test with add/remove after import/export as Jeremy =
suggested.
>>>=20
>>> //mxb
>>>=20
>>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote:
>>>=20
>>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>>>>=20
>>>>> Sure, script is not perfects yet and does not handle many of =
stuff, but moving highlight from zpool import/export to the script =
itself not that
>>>>> clever,as this works most of the time.
>>>>>=20
>>>>> Question is WHY ZFS corrupts metadata then it should not. =
Sometimes.
>>>>> I'v seen stale of zpool then manually importing/exporting pool.
>>>>>=20
>>>>>=20
>>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote:
>>>>>=20
>>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>>>>=20
>>>>>>> Then MASTER goes down, CARP on the second node goes MASTER =
(devd.conf, and script for lifting):
>>>>>>>=20
>>>>>>> root@nfs2:/root # cat /etc/devd.conf
>>>>>>>=20
>>>>>>>=20
>>>>>>> notify 30 {
>>>>>>> match "system"		"IFNET";
>>>>>>> match "subsystem"	"carp0";
>>>>>>> match "type"		"LINK_UP";
>>>>>>> action "/etc/zfs_switch.sh active";
>>>>>>> };
>>>>>>>=20
>>>>>>> notify 30 {
>>>>>>> match "system"          "IFNET";
>>>>>>> match "subsystem"       "carp0";
>>>>>>> match "type"            "LINK_DOWN";
>>>>>>> action "/etc/zfs_switch.sh backup";
>>>>>>> };
>>>>>>>=20
>>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh
>>>>>>> #!/bin/sh
>>>>>>>=20
>>>>>>> DATE=3D`date +%Y%m%d`
>>>>>>> HOSTNAME=3D`hostname`
>>>>>>>=20
>>>>>>> ZFS_POOL=3D"jbod"
>>>>>>>=20
>>>>>>>=20
>>>>>>> case $1 in
>>>>>>> 	active)
>>>>>>> 		echo "Switching to ACTIVE and importing ZFS" | =
mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>>>>> 		sleep 10
>>>>>>> 		/sbin/zpool import -f jbod
>>>>>>> 		/etc/rc.d/mountd restart
>>>>>>> 		/etc/rc.d/nfsd restart
>>>>>>> 		;;
>>>>>>> 	backup)
>>>>>>> 		echo "Switching to BACKUP and exporting ZFS" | =
mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>>>>> 		/sbin/zpool export jbod
>>>>>>> 		/etc/rc.d/mountd restart
>>>>>>>            /etc/rc.d/nfsd restart
>>>>>>> 		;;
>>>>>>> 	*)
>>>>>>> 		exit 0
>>>>>>> 		;;
>>>>>>> esac
>>>>>>>=20
>>>>>>> This works, most of the time, but sometimes I'm forced to =
re-create pool. Those machines suppose to go into prod.
>>>>>>> Loosing pool(and data inside it) stops me from deploy this =
setup.
>>>>>>=20
>>>>>> This script looks highly error-prone.  Hasty hasty...  :-)
>>>>>>=20
>>>>>> This script assumes that the "zpool" commands (import and export) =
always
>>>>>> work/succeed; there is no exit code ($?) checking being used.
>>>>>>=20
>>>>>> Since this is run from within devd(8): where does stdout/stderr =
go to
>>>>>> when running a program/script under devd(8)?  Does it effectively =
go
>>>>>> to the bit bucket (/dev/null)?  If so, you'd never know if the =
import or
>>>>>> export actually succeeded or not (the export sounds more likely =
to be
>>>>>> the problem point).
>>>>>>=20
>>>>>> I imagine there would be some situations where the export would =
fail
>>>>>> (some files on filesystems under pool "jbod" still in use), yet =
CARP is
>>>>>> already blindly assuming everything will be fantastic.  Surprise.
>>>>>>=20
>>>>>> I also do not know if devd.conf(5) "action" commands spawn a =
sub-shell
>>>>>> (/bin/sh) or not.  If they don't, you won't be able to use things =
like"
>>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'.  =
You
>>>>>> would then need to implement the equivalent of logging within =
your
>>>>>> zfs_switch.sh script.
>>>>>>=20
>>>>>> You may want to consider the -f flag to zpool import/export
>>>>>> (particularly export).  However there are risks involved -- =
userland
>>>>>> applications which have an fd/fh open on a file which is stored =
on a
>>>>>> filesystem that has now completely disappeared can sometimes =
crash
>>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending =
on how
>>>>>> they're designed.
>>>>>>=20
>>>>>> Basically what I'm trying to say is that devd(8) being used as a =
form of
>>>>>> HA (high availability) and load balancing is not always possible.
>>>>>> Real/true HA (especially with SANs) is often done very =
differently (now
>>>>>> you know why it's often proprietary.  :-) )
>>>>=20
>>>> Add error checking to your script.  That's my first and foremost
>>>> recommendation.  It's not hard to do, really.  :-)
>>>>=20
>>>> After you do that and still experience the issue (e.g. you see no =
actual
>>>> errors/issues during the export/import phases), I recommend =
removing
>>>> the "cache" devices which are "independent" on each system from the =
pool
>>>> entirely.  Quoting you (for readers, since I snipped it from my =
previous
>>>> reply):
>>>>=20
>>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only =
L2ARC
>>>>>>> is both local and external - da1,da2, da13s2, da14s2
>>>>=20
>>>> I interpret this to mean the primary and backup nodes (physical =
systems)
>>>> have actual disks which are not part of the "external enclosure".  =
If
>>>> that's the case -- those disks are always going to vary in their
>>>> contents and metadata.  Those are never going to be 100% identical =
all
>>>> the time (is this not obvious?).  I'm surprised your stuff has =
worked at
>>>> all using that model, honestly.
>>>>=20
>>>> ZFS is going to bitch/cry if it cannot verify the integrity of =
certain
>>>> things, all the way down to the L2ARC.  That's my understanding of =
it at
>>>> least, meaning there must always be "some" kind of metadata that =
has to
>>>> be kept/maintained there.
>>>>=20
>>>> Alternately you could try doing this:
>>>>=20
>>>> zpool remove jbod cache daX daY ...
>>>> zpool export jbod
>>>>=20
>>>> Then on the other system:
>>>>=20
>>>> zpool import jbod
>>>> zpool add jbod cache daX daY ...
>>>>=20
>>>> Where daX and daY are the disks which are independent to each =
system
>>>> (not on the "external enclosure").
>>>>=20
>>>> Finally, it would also be useful/worthwhile if you would provide=20
>>>> "dmesg" from both systems and for you to explain the physical =
wiring
>>>> along with what device (e.g. daX) correlates with what exact thing =
on
>>>> each system.  (We right now have no knowledge of that, and your =
terse
>>>> explanations imply we do -- we need to know more)
>>>>=20
>>>> --=20
>>>> | Jeremy Chadwick                                   jdc@koitsu.org =
|
>>>> | UNIX Systems Administrator                http://jdc.koitsu.org/ =
|
>>>> | Making life hard for others since 1977.             PGP 4BD6C0CB =
|
>>>>=20
>>>=20
>>=20
>=20