Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Apr 2012 12:55:06 -0500
From:      Dustin Wenz <dustinwenz@xtechllc.com>
To:        freebsd-stable@freebsd.org
Subject:   Can MPS discard a misbehaving disk?
Message-ID:  <17DD4C39-6905-4A5B-AE86-87F149CBD5BC@xtechllc.com>

next in thread | raw e-mail | index | archive | help
I am having trouble with MPS becoming unresponsive in certain disk =
failure conditions. So far, I've experienced this with 3TB Hitachi disks =
(0S03208) and 3TB Seagate Barracuda disks (ST3000DM001, firmware CC9D) =
while using the MPS driver with an LSI SAS2116 controller on FreeBSD =
8.2-STABLE.

In these particular instances, the disks are part of a zpool of mirrors. =
When a disk fails, I generally see a message like "kernel: =
(da5:mps0:0:5:0): SCSI command timeout on device handle 0x0017 SMID =
148", followed by an indefinite number of "mps0: (0:5:0) terminated ioc =
804b scsi 0 state c xfer 65536" messages.

What I would want to happen in this case is for the disk to simply go =
offline in the zpool, in order for the pool to continue functioning. =
However, the pool status still shows the disk as online. Any attempts to =
disable the disk (such as with zpool offline, remove, or detach) will =
hang and never complete, as will attempting a rescan with camcontrol. Of =
course, any attempts to access data in the pool will hang as well.

Rebooting the system in this state is also bad; when the disk is first =
discovered, it will begin a cycle of mps scsi errors during startup that =
never seem to stop. The only way to recover, at least that I know of, is =
to physically remove the disk from the chassis. Once I do that, the =
system continues running perfectly.

Basically my question is this: How can I get MPS to ignore a failed disk =
and never attempt to access it again? I don't care if it does so =
automatically, or I if I need to perform some administrative operation =
to drop the device reference. I've seen a number of people on the list =
having problems that appear similar to this; but those seem more to do =
with firmware or compatibility issues. I my case, these disks are =
definitely dead... they no longer work in any other systems, and often =
make sad clicking noises.

I suppose this is also something that ZFS could do, independent of the =
driver. If a device is unresponsive, shouldn't it take it offline on =
it's own?

	- .Dustin




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?17DD4C39-6905-4A5B-AE86-87F149CBD5BC>