FreeBSD Mail Archives

Date:      Tue, 25 Jun 2013 21:22:43 +0200
From:      mxb <mxb@alumni.chalmers.se>
To:        Jeremy Chadwick <jdc@koitsu.org>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zpool export/import on failover - The pool metadata is corrupted
Message-ID:  <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se>
In-Reply-To: <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se>
References:  <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se> <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com> <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se>


I think I'v found the root of this issue.
Looks like "wiring down" disks the same way on both nodes (as suggested) =
fixes this issue.

//mxb

On 20 jun 2013, at 12:30, mxb <mxb@alumni.chalmers.se> wrote:

>=20
> Well,
>=20
> I'm back to square one.
>=20
> After some uptime and successful import/export from one node to =
another, I eventually got 'metadata corruption'.
> I had no problem with import/export while for ex. rebooting =
master-node (nfs1), but not THIS time.
> Metdata got corrupted while rebooting master-node??
>=20
> Any ideas?=20
>=20
> [root@nfs1 ~]# zpool import
>   pool: jbod
>     id: 7663925948774378610
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>   see: http://illumos.org/msg/ZFS-8000-72
> config:
>=20
> 	jbod        FAULTED  corrupted data
> 	  raidz3-0  ONLINE
> 	    da3     ONLINE
> 	    da4     ONLINE
> 	    da5     ONLINE
> 	    da6     ONLINE
> 	    da7     ONLINE
> 	    da8     ONLINE
> 	    da9     ONLINE
> 	    da10    ONLINE
> 	    da11    ONLINE
> 	    da12    ONLINE
> 	cache
> 	  da13s2
> 	  da14s2
> 	logs
> 	  mirror-1  ONLINE
> 	    da13s1  ONLINE
> 	    da14s1  ONLINE
> [root@nfs1 ~]# zpool import jbod
> cannot import 'jbod': I/O error
> 	Destroy and re-create the pool from
> 	a backup source.
> [root@nfs1 ~]#
>=20
> On 11 jun 2013, at 10:46, mxb <mxb@alumni.chalmers.se> wrote:
>=20
>>=20
>> Thanks everyone whom replied.
>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure =
to my problem.
>>=20
>> Next is to test with add/remove after import/export as Jeremy =
suggested.
>>=20
>> //mxb
>>=20
>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote:
>>=20
>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>>>=20
>>>> Sure, script is not perfects yet and does not handle many of stuff, =
but moving highlight from zpool import/export to the script itself not =
that
>>>> clever,as this works most of the time.
>>>>=20
>>>> Question is WHY ZFS corrupts metadata then it should not. =
Sometimes.
>>>> I'v seen stale of zpool then manually importing/exporting pool.
>>>>=20
>>>>=20
>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote:
>>>>=20
>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>>>=20
>>>>>> Then MASTER goes down, CARP on the second node goes MASTER =
(devd.conf, and script for lifting):
>>>>>>=20
>>>>>> root@nfs2:/root # cat /etc/devd.conf
>>>>>>=20
>>>>>>=20
>>>>>> notify 30 {
>>>>>> match "system"		"IFNET";
>>>>>> match "subsystem"	"carp0";
>>>>>> match "type"		"LINK_UP";
>>>>>> action "/etc/zfs_switch.sh active";
>>>>>> };
>>>>>>=20
>>>>>> notify 30 {
>>>>>> match "system"          "IFNET";
>>>>>> match "subsystem"       "carp0";
>>>>>> match "type"            "LINK_DOWN";
>>>>>> action "/etc/zfs_switch.sh backup";
>>>>>> };
>>>>>>=20
>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh
>>>>>> #!/bin/sh
>>>>>>=20
>>>>>> DATE=3D`date +%Y%m%d`
>>>>>> HOSTNAME=3D`hostname`
>>>>>>=20
>>>>>> ZFS_POOL=3D"jbod"
>>>>>>=20
>>>>>>=20
>>>>>> case $1 in
>>>>>> 	active)
>>>>>> 		echo "Switching to ACTIVE and importing ZFS" | mail -s =
''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>>>> 		sleep 10
>>>>>> 		/sbin/zpool import -f jbod
>>>>>> 		/etc/rc.d/mountd restart
>>>>>> 		/etc/rc.d/nfsd restart
>>>>>> 		;;
>>>>>> 	backup)
>>>>>> 		echo "Switching to BACKUP and exporting ZFS" | mail -s =
''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>>>> 		/sbin/zpool export jbod
>>>>>> 		/etc/rc.d/mountd restart
>>>>>>             /etc/rc.d/nfsd restart
>>>>>> 		;;
>>>>>> 	*)
>>>>>> 		exit 0
>>>>>> 		;;
>>>>>> esac
>>>>>>=20
>>>>>> This works, most of the time, but sometimes I'm forced to =
re-create pool. Those machines suppose to go into prod.
>>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
>>>>>=20
>>>>> This script looks highly error-prone.  Hasty hasty...  :-)
>>>>>=20
>>>>> This script assumes that the "zpool" commands (import and export) =
always
>>>>> work/succeed; there is no exit code ($?) checking being used.
>>>>>=20
>>>>> Since this is run from within devd(8): where does stdout/stderr go =
to
>>>>> when running a program/script under devd(8)?  Does it effectively =
go
>>>>> to the bit bucket (/dev/null)?  If so, you'd never know if the =
import or
>>>>> export actually succeeded or not (the export sounds more likely to =
be
>>>>> the problem point).
>>>>>=20
>>>>> I imagine there would be some situations where the export would =
fail
>>>>> (some files on filesystems under pool "jbod" still in use), yet =
CARP is
>>>>> already blindly assuming everything will be fantastic.  Surprise.
>>>>>=20
>>>>> I also do not know if devd.conf(5) "action" commands spawn a =
sub-shell
>>>>> (/bin/sh) or not.  If they don't, you won't be able to use things =
like"
>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'.  =
You
>>>>> would then need to implement the equivalent of logging within your
>>>>> zfs_switch.sh script.
>>>>>=20
>>>>> You may want to consider the -f flag to zpool import/export
>>>>> (particularly export).  However there are risks involved -- =
userland
>>>>> applications which have an fd/fh open on a file which is stored on =
a
>>>>> filesystem that has now completely disappeared can sometimes crash
>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending =
on how
>>>>> they're designed.
>>>>>=20
>>>>> Basically what I'm trying to say is that devd(8) being used as a =
form of
>>>>> HA (high availability) and load balancing is not always possible.
>>>>> Real/true HA (especially with SANs) is often done very differently =
(now
>>>>> you know why it's often proprietary.  :-) )
>>>=20
>>> Add error checking to your script.  That's my first and foremost
>>> recommendation.  It's not hard to do, really.  :-)
>>>=20
>>> After you do that and still experience the issue (e.g. you see no =
actual
>>> errors/issues during the export/import phases), I recommend removing
>>> the "cache" devices which are "independent" on each system from the =
pool
>>> entirely.  Quoting you (for readers, since I snipped it from my =
previous
>>> reply):
>>>=20
>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only =
L2ARC
>>>>>> is both local and external - da1,da2, da13s2, da14s2
>>>=20
>>> I interpret this to mean the primary and backup nodes (physical =
systems)
>>> have actual disks which are not part of the "external enclosure".  =
If
>>> that's the case -- those disks are always going to vary in their
>>> contents and metadata.  Those are never going to be 100% identical =
all
>>> the time (is this not obvious?).  I'm surprised your stuff has =
worked at
>>> all using that model, honestly.
>>>=20
>>> ZFS is going to bitch/cry if it cannot verify the integrity of =
certain
>>> things, all the way down to the L2ARC.  That's my understanding of =
it at
>>> least, meaning there must always be "some" kind of metadata that has =
to
>>> be kept/maintained there.
>>>=20
>>> Alternately you could try doing this:
>>>=20
>>> zpool remove jbod cache daX daY ...
>>> zpool export jbod
>>>=20
>>> Then on the other system:
>>>=20
>>> zpool import jbod
>>> zpool add jbod cache daX daY ...
>>>=20
>>> Where daX and daY are the disks which are independent to each system
>>> (not on the "external enclosure").
>>>=20
>>> Finally, it would also be useful/worthwhile if you would provide=20
>>> "dmesg" from both systems and for you to explain the physical wiring
>>> along with what device (e.g. daX) correlates with what exact thing =
on
>>> each system.  (We right now have no knowledge of that, and your =
terse
>>> explanations imply we do -- we need to know more)
>>>=20
>>> --=20
>>> | Jeremy Chadwick                                   jdc@koitsu.org |
>>> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
>>> | Making life hard for others since 1977.             PGP 4BD6C0CB |
>>>=20
>>=20
>=20

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation