From owner-freebsd-current@FreeBSD.ORG  Wed Sep 26 08:42:12 2012
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 65E541065672
	for <freebsd-current@freebsd.org>; Wed, 26 Sep 2012 08:42:12 +0000 (UTC)
	(envelope-from to.my.trociny@gmail.com)
Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com
	[209.85.212.172])
	by mx1.freebsd.org (Postfix) with ESMTP id E50448FC0A
	for <freebsd-current@freebsd.org>; Wed, 26 Sep 2012 08:42:11 +0000 (UTC)
Received: by wibhq12 with SMTP id hq12so3527051wib.13
	for <freebsd-current@freebsd.org>; Wed, 26 Sep 2012 01:42:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=sender:date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to:user-agent;
	bh=KHuyTf74v1puM1XVf9+UzuU6uZxwHUDzjOJOn1Bugz4=;
	b=QoHCJT3ozEaKg8dZqHboiQVkJkikImNsDBE3PTF+ci8CQrNWv3SzqxyyOkc8EqipTQ
	MrrpcMEyuBhXF1VA1vhJBNQYFPeM1cKTL2AbWLAgrSxIek3BvdfLV5iW0df3i2VQ0AOf
	JWPmWaSvNlcDFQ/mMA7+Jt/qdcivLk7m/NAen7oNR26+eBLZEX2xiH/qgEAXrk6cef+n
	315tf5wbp6jkpX1byk6rlKXSEwancIU/AeJ8pqyXRFSVgDqQ/RSNItPEJK6SvD6YroDa
	In29/dKrPXrq9lxhmHl3FUmYo0QQe0NXxp3q9898G6Fx06DcYz6CgKXZpv8BDneJFZY1
	6zgw==
Received: by 10.180.102.136 with SMTP id fo8mr27763504wib.19.1348648930847;
	Wed, 26 Sep 2012 01:42:10 -0700 (PDT)
Received: from localhost ([95.69.174.83])
	by mx.google.com with ESMTPS id f10sm5526897wiy.9.2012.09.26.01.42.09
	(version=TLSv1/SSLv3 cipher=OTHER);
	Wed, 26 Sep 2012 01:42:10 -0700 (PDT)
Sender: Mikolaj Golub <to.my.trociny@gmail.com>
Date: Wed, 26 Sep 2012 11:42:07 +0300
From: Mikolaj Golub <trociny@FreeBSD.org>
To: "Jose A. Lombera" <jose@lajni.com>
Message-ID: <20120926084031.GA6578@gmail.com>
References: <013101cd9a18$8106ede0$8314c9a0$@lajni.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <013101cd9a18$8106ede0$8314c9a0$@lajni.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-current@freebsd.org
Subject: Re: zpool can't bring online disk2 ----I screwed up
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2012 08:42:12 -0000

On Sun, Sep 23, 2012 at 10:50:28PM -0700, Jose A. Lombera wrote:

> This is the error I got when I run the failover script.
> 
>  
> 
> Sep 24 06:43:39 san1 hastd[3404]: [disk3] (primary) Provider /dev/mfid3 is not part of resource disk3.
> 
> Sep 24 06:43:39 san1 hastd[3343]: [disk3] (primary) Worker process exited ungracefully (pid=3404, exitcode=66).
> 
> Sep 24 06:43:39 san1 hastd[3413]: [disk6] (primary) Provider /dev/mfid6 is not part of resource disk6.
> 
> Sep 24 06:43:39 san1 hastd[3343]: [disk6] (primary) Worker process exited ungracefully (pid=3413, exitcode=66).
> 
> Sep 24 06:43:39 san1 hastd[3425]: [disk10] (primary) Unable to open /dev/mfid10: No such file or directory.
> 
> Sep 24 06:43:39 san1 hastd[3407]: [disk4] (primary) Provider /dev/mfid4 is not part of resource disk4.

This looks like your disk numbering has changed? Your another email
confirms this. Then you should change it accordingly in hast.conf.

> Sep 24 06:43:40 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803).
> 
> Sep 24 06:43:45 san1 hastd[3348]: [disk1] (primary) Split-brain condition!
> 
> Sep 24 06:43:50 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803).
> 
> Sep 24 06:43:55 san1 hastd[3348]: [disk1] (primary) Split-brain condition!

Split-brain can only be fixed manually, deciding what host contains
actual data and recreating HAST resources (disk1 and disk2 in this
case) on another host.

The simplest way to recover from your situation looks like the following:

Supposing that host A is a host where the disk was changed and things
messed up and host B is a "good" host.

1) Disable auto failovering if you have any.
2) On host A set all HAST resources to init.
3) On host B set all HAST resources to primary.
4) On host B import pool and check that it works ok here and you have
   your data.
5) On host A recreate HAST resources (hastctl create disk1...)
6) On host A change role to secondary for all HAST
   resources. A synchronization process should start.
7) Wait until the synchronization is complete, checking hastctl status on
   B (primary) host

After this you can switch the pool to the host A again if you want and
enable auto failovering.

Actually you can switch to the host A not waiting until the
synchronization is complete. It will work, but read requests will go
to the remote host B until the synchronization is complete, so I would
not do this until there are good reasons for this.

It might be possible to recover faster, without recreating/resyncing
all devices, depending on how things messed up, fixing the disk
numbering in hast.conf and recreating/resyncing only resources in
split-brain state. But it would require more manual work, careful
investigation of logs and good understanding what you are doing.

-- 
Mikolaj Golub