From owner-freebsd-current@FreeBSD.ORG Wed Sep 26 08:42:12 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 65E541065672 for ; Wed, 26 Sep 2012 08:42:12 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by mx1.freebsd.org (Postfix) with ESMTP id E50448FC0A for ; Wed, 26 Sep 2012 08:42:11 +0000 (UTC) Received: by wibhq12 with SMTP id hq12so3527051wib.13 for ; Wed, 26 Sep 2012 01:42:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=KHuyTf74v1puM1XVf9+UzuU6uZxwHUDzjOJOn1Bugz4=; b=QoHCJT3ozEaKg8dZqHboiQVkJkikImNsDBE3PTF+ci8CQrNWv3SzqxyyOkc8EqipTQ MrrpcMEyuBhXF1VA1vhJBNQYFPeM1cKTL2AbWLAgrSxIek3BvdfLV5iW0df3i2VQ0AOf JWPmWaSvNlcDFQ/mMA7+Jt/qdcivLk7m/NAen7oNR26+eBLZEX2xiH/qgEAXrk6cef+n 315tf5wbp6jkpX1byk6rlKXSEwancIU/AeJ8pqyXRFSVgDqQ/RSNItPEJK6SvD6YroDa In29/dKrPXrq9lxhmHl3FUmYo0QQe0NXxp3q9898G6Fx06DcYz6CgKXZpv8BDneJFZY1 6zgw== Received: by 10.180.102.136 with SMTP id fo8mr27763504wib.19.1348648930847; Wed, 26 Sep 2012 01:42:10 -0700 (PDT) Received: from localhost ([95.69.174.83]) by mx.google.com with ESMTPS id f10sm5526897wiy.9.2012.09.26.01.42.09 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 26 Sep 2012 01:42:10 -0700 (PDT) Sender: Mikolaj Golub Date: Wed, 26 Sep 2012 11:42:07 +0300 From: Mikolaj Golub To: "Jose A. Lombera" Message-ID: <20120926084031.GA6578@gmail.com> References: <013101cd9a18$8106ede0$8314c9a0$@lajni.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <013101cd9a18$8106ede0$8314c9a0$@lajni.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-current@freebsd.org Subject: Re: zpool can't bring online disk2 ----I screwed up X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2012 08:42:12 -0000 On Sun, Sep 23, 2012 at 10:50:28PM -0700, Jose A. Lombera wrote: > This is the error I got when I run the failover script. > > > > Sep 24 06:43:39 san1 hastd[3404]: [disk3] (primary) Provider /dev/mfid3 is not part of resource disk3. > > Sep 24 06:43:39 san1 hastd[3343]: [disk3] (primary) Worker process exited ungracefully (pid=3404, exitcode=66). > > Sep 24 06:43:39 san1 hastd[3413]: [disk6] (primary) Provider /dev/mfid6 is not part of resource disk6. > > Sep 24 06:43:39 san1 hastd[3343]: [disk6] (primary) Worker process exited ungracefully (pid=3413, exitcode=66). > > Sep 24 06:43:39 san1 hastd[3425]: [disk10] (primary) Unable to open /dev/mfid10: No such file or directory. > > Sep 24 06:43:39 san1 hastd[3407]: [disk4] (primary) Provider /dev/mfid4 is not part of resource disk4. This looks like your disk numbering has changed? Your another email confirms this. Then you should change it accordingly in hast.conf. > Sep 24 06:43:40 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803). > > Sep 24 06:43:45 san1 hastd[3348]: [disk1] (primary) Split-brain condition! > > Sep 24 06:43:50 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch (primary=2635341666474957411, secondary=5944493181984227803). > > Sep 24 06:43:55 san1 hastd[3348]: [disk1] (primary) Split-brain condition! Split-brain can only be fixed manually, deciding what host contains actual data and recreating HAST resources (disk1 and disk2 in this case) on another host. The simplest way to recover from your situation looks like the following: Supposing that host A is a host where the disk was changed and things messed up and host B is a "good" host. 1) Disable auto failovering if you have any. 2) On host A set all HAST resources to init. 3) On host B set all HAST resources to primary. 4) On host B import pool and check that it works ok here and you have your data. 5) On host A recreate HAST resources (hastctl create disk1...) 6) On host A change role to secondary for all HAST resources. A synchronization process should start. 7) Wait until the synchronization is complete, checking hastctl status on B (primary) host After this you can switch the pool to the host A again if you want and enable auto failovering. Actually you can switch to the host A not waiting until the synchronization is complete. It will work, but read requests will go to the remote host B until the synchronization is complete, so I would not do this until there are good reasons for this. It might be possible to recover faster, without recreating/resyncing all devices, depending on how things messed up, fixing the disk numbering in hast.conf and recreating/resyncing only resources in split-brain state. But it would require more manual work, careful investigation of logs and good understanding what you are doing. -- Mikolaj Golub