Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Sep 2012 11:42:07 +0300
From:      Mikolaj Golub <trociny@FreeBSD.org>
To:        "Jose A. Lombera" <jose@lajni.com>
Cc:        freebsd-current@freebsd.org
Subject:   Re: zpool can't bring online disk2 ----I screwed up
Message-ID:  <20120926084031.GA6578@gmail.com>
In-Reply-To: <013101cd9a18$8106ede0$8314c9a0$@lajni.com>
References:  <013101cd9a18$8106ede0$8314c9a0$@lajni.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Sep 23, 2012 at 10:50:28PM -0700, Jose A. Lombera wrote:

> This is the error I got when I run the failover script.
> 
>  
> 
> Sep 24 06:43:39 san1 hastd[3404]: [disk3] (primary) Provider /dev/mfid3 is not part of resource disk3.
> 
> Sep 24 06:43:39 san1 hastd[3343]: [disk3] (primary) Worker process exited ungracefully (pid=3404, exitcode=66).
> 
> Sep 24 06:43:39 san1 hastd[3413]: [disk6] (primary) Provider /dev/mfid6 is not part of resource disk6.
> 
> Sep 24 06:43:39 san1 hastd[3343]: [disk6] (primary) Worker process exited ungracefully (pid=3413, exitcode=66).
> 
> Sep 24 06:43:39 san1 hastd[3425]: [disk10] (primary) Unable to open /dev/mfid10: No such file or directory.
> 
> Sep 24 06:43:39 san1 hastd[3407]: [disk4] (primary) Provider /dev/mfid4 is not part of resource disk4.

This looks like your disk numbering has changed? Your another email
confirms this. Then you should change it accordingly in hast.conf.

> Sep 24 06:43:40 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803).
> 
> Sep 24 06:43:45 san1 hastd[3348]: [disk1] (primary) Split-brain condition!
> 
> Sep 24 06:43:50 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803).
> 
> Sep 24 06:43:55 san1 hastd[3348]: [disk1] (primary) Split-brain condition!

Split-brain can only be fixed manually, deciding what host contains
actual data and recreating HAST resources (disk1 and disk2 in this
case) on another host.

The simplest way to recover from your situation looks like the following:

Supposing that host A is a host where the disk was changed and things
messed up and host B is a "good" host.

1) Disable auto failovering if you have any.
2) On host A set all HAST resources to init.
3) On host B set all HAST resources to primary.
4) On host B import pool and check that it works ok here and you have
   your data.
5) On host A recreate HAST resources (hastctl create disk1...)
6) On host A change role to secondary for all HAST
   resources. A synchronization process should start.
7) Wait until the synchronization is complete, checking hastctl status on
   B (primary) host

After this you can switch the pool to the host A again if you want and
enable auto failovering.

Actually you can switch to the host A not waiting until the
synchronization is complete. It will work, but read requests will go
to the remote host B until the synchronization is complete, so I would
not do this until there are good reasons for this.

It might be possible to recover faster, without recreating/resyncing
all devices, depending on how things messed up, fixing the disk
numbering in hast.conf and recreating/resyncing only resources in
split-brain state. But it would require more manual work, careful
investigation of logs and good understanding what you are doing.

-- 
Mikolaj Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120926084031.GA6578>