Date: Wed, 26 Sep 2012 11:42:07 +0300 From: Mikolaj Golub <trociny@FreeBSD.org> To: "Jose A. Lombera" <jose@lajni.com> Cc: freebsd-current@freebsd.org Subject: Re: zpool can't bring online disk2 ----I screwed up Message-ID: <20120926084031.GA6578@gmail.com> In-Reply-To: <013101cd9a18$8106ede0$8314c9a0$@lajni.com> References: <013101cd9a18$8106ede0$8314c9a0$@lajni.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Sep 23, 2012 at 10:50:28PM -0700, Jose A. Lombera wrote: > This is the error I got when I run the failover script. > > > > Sep 24 06:43:39 san1 hastd[3404]: [disk3] (primary) Provider /dev/mfid3 is not part of resource disk3. > > Sep 24 06:43:39 san1 hastd[3343]: [disk3] (primary) Worker process exited ungracefully (pid=3404, exitcode=66). > > Sep 24 06:43:39 san1 hastd[3413]: [disk6] (primary) Provider /dev/mfid6 is not part of resource disk6. > > Sep 24 06:43:39 san1 hastd[3343]: [disk6] (primary) Worker process exited ungracefully (pid=3413, exitcode=66). > > Sep 24 06:43:39 san1 hastd[3425]: [disk10] (primary) Unable to open /dev/mfid10: No such file or directory. > > Sep 24 06:43:39 san1 hastd[3407]: [disk4] (primary) Provider /dev/mfid4 is not part of resource disk4. This looks like your disk numbering has changed? Your another email confirms this. Then you should change it accordingly in hast.conf. > Sep 24 06:43:40 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803). > > Sep 24 06:43:45 san1 hastd[3348]: [disk1] (primary) Split-brain condition! > > Sep 24 06:43:50 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803). > > Sep 24 06:43:55 san1 hastd[3348]: [disk1] (primary) Split-brain condition! Split-brain can only be fixed manually, deciding what host contains actual data and recreating HAST resources (disk1 and disk2 in this case) on another host. The simplest way to recover from your situation looks like the following: Supposing that host A is a host where the disk was changed and things messed up and host B is a "good" host. 1) Disable auto failovering if you have any. 2) On host A set all HAST resources to init. 3) On host B set all HAST resources to primary. 4) On host B import pool and check that it works ok here and you have your data. 5) On host A recreate HAST resources (hastctl create disk1...) 6) On host A change role to secondary for all HAST resources. A synchronization process should start. 7) Wait until the synchronization is complete, checking hastctl status on B (primary) host After this you can switch the pool to the host A again if you want and enable auto failovering. Actually you can switch to the host A not waiting until the synchronization is complete. It will work, but read requests will go to the remote host B until the synchronization is complete, so I would not do this until there are good reasons for this. It might be possible to recover faster, without recreating/resyncing all devices, depending on how things messed up, fixing the disk numbering in hast.conf and recreating/resyncing only resources in split-brain state. But it would require more manual work, careful investigation of logs and good understanding what you are doing. -- Mikolaj Golub
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120926084031.GA6578>