Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Feb 2013 14:54:54 -0600
From:      Chad M Stewart <cms@balius.com>
To:        freebsd-questions@freebsd.org
Subject:   HAST - detect failure and restore avoiding an outage?
Message-ID:  <E3C8C9A2-712E-4925-995A-0471CCD3515B@balius.com>

next in thread | raw e-mail | index | archive | help

I built a 2 node cluster for testing HAST out.  Each node is an older HP =
server with 6 scsi disks.  Each disk is configured as RAID 0 in the raid =
controller, I wanted a JBOD to be presented to FreeBSD 9.1 x86.  I =
allocated a single disk for the OS, and the other 5 disks for HAST.

node2# zpool status
  pool: scsi-san
 state: ONLINE
  scan: scrub repaired 0 in 0h27m with 0 errors on Tue Feb 19 17:38:55 =
2013
config:

	NAME            STATE     READ WRITE CKSUM
	scsi-san        ONLINE       0     0     0
	  raidz1-0      ONLINE       0     0     0
	    hast/disk1  ONLINE       0     0     0
	    hast/disk2  ONLINE       0     0     0
	    hast/disk3  ONLINE       0     0     0
	    hast/disk4  ONLINE       0     0     0
	    hast/disk5  ONLINE       0     0     0


  pool: zroot
 state: ONLINE
  scan: none requested
config:

	NAME         STATE     READ WRITE CKSUM
	zroot        ONLINE       0     0     0
	  gpt/disk0  ONLINE       0     0     0



Yesterday I physically pulled disk2 (from node1) out to simulate a =
failure.  ZFS didn't see anything wrong, expected.  hastd did see the =
problem, expected.  'hastctl status' didn't show me anything unusual or =
indicate any problem that I could see on either node.  I saw hastd =
reporting problems in the logs, otherwise everything looked fine.  Is =
there a way to detect a failed disk from hastd besides the log?  =
camcontrol showed the disk had failed and obviously I'll be monitoring =
using it as well.


For recovery I installed a new disk in the same slot.  To protect the =
data reliability the safest way I can think of to recover is to do the =
following:

1 - node1 - stop the apps
2 - node1 - export pool
3 - node1 - hastctl create disk2
4 - node1 - for D in 1 2 3 4 5; do hastctl role secondary;done
5 - node2 - for D in 1 2 3 4 5; do hastctl role primary;done
6 - node2 - import pool
7 - node2 - start the apps

At step 5 the hastd will start to resynchronize node2:disk2 -> =
node1:disk2.  I've been trying to think of a way to re-establish the =
mirror without having to restart/move the pool _and_ not pose additional =
risk of data loss.

To avoid an application outage I suppose the following would work:

1 - insert new disk in node1
2 - hastctl role init disk2
3 - hastctl create disk2
4 - hastctl role primary disk2

At that point ZFS would have seen a disk failure and then started =
resilvering the pool. No application outage, but now only 4 disks =
contain the data (assuming changing bits on the pool, not static =
content).  Using the previous steps application outage, but a healthy =
pool is maintained always.

Is there another scenario I'm thinking of where both data health and no =
application outage could be achieved?


Regards,
Chad




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E3C8C9A2-712E-4925-995A-0471CCD3515B>