Date: Mon, 15 Jul 2013 19:18:33 +1000 From: Jason Birch <jbirch@jbirch.net> To: "freebsd-questions@freebsd.org" <freebsd-questions@freebsd.org> Subject: Re: Re-add lost device entries without a reboot; troubleshoot RAID card Message-ID: <CAA=KUhvfPPTD=yq3JSguVED=hdUYi2grynhSwpfHN7Q0xfEyiQ@mail.gmail.com> In-Reply-To: <CAA=KUhtLWe1h2z_sS8vCPRWf-nvv=asVBUFrpPyVuVuyksTPfA@mail.gmail.com> References: <CAA=KUhtLWe1h2z_sS8vCPRWf-nvv=asVBUFrpPyVuVuyksTPfA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jul 14, 2013 at 9:38 PM, Jason Birch <jbirch@jbirch.net> wrote: > I have several hard drives running through an M1015 flashed to think it's > an LSI 9211-8i IT. I've been running them successfully for the last three > months through mps(4) as part of a raidz pool, but had the pool drop to a > degraded state when /dev/da0 (and associated gpt device) disappeared after > some apparent errors. > > After a reboot, I noticed that the disk that disappeared - da0 - was > successfully probed and resilvered back in to the existing pool. I ran a > short SMART self test and everything was fine. I ran a long SMART self test > and the drive disappeared again towards the end of the scan (I didn't get a > chance to view the results) > > I'd like to know if there's a way to suggest to 're-probe' connections to > see if there are any devices that can be reconnected. It's clear that the > drive is still around and at least partially responsive - is there a way I > can online this disk, as just a device in its own right, such that I can > finishing running the SMART diagnostics? > > I've read some old mentions of mps not being the most stable thing under > load, but the mentions are over a year old. The initial failure happened > right at the time the daily periodic was running (Which includes a check > for negative permissions on the zfs partition) and the second failure was > during a SMART long test, so I guess there's potential for "load" there. > How might I go about diagnosing whether this is just the drive or possibly > the card itself? I suppose the obvious "Move it off the raid card" is > probably a good first start... > > $ uname -a > FreeBSD blackfyre 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #0: Mon Jun 17 > 11:42:37 UTC 2013 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC > amd64 > > dmesg output when things started going south the first time: > > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 > 5a ca e4 98 0 0 8 0 length 4096 SMID 563 terminated ioc 804b scsi 0 state c > xfer 0 > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 > 23 55 ec 58 0 0 8 0 length 4096 SMID 557 terminated ioc 804b scsi 0 state c > xfer 0 > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 > 5a d7 a7 f8 0 0 8 0 length 4096 SMID 889 terminated ioc 804b scsi 0 state c > xfer 0 > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 > 23 55 ec 60 0 0 8 0 length 4096 SMID 61 terminated ioc 804b scsi 0 state c > xfer 0 > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 > 23 55 ec 60 0 0 8 0 > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI > Status Error > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check > Condition > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per > sense data) > Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 > 23 55 ec a0 0 0 8 0 > Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI > Status Error > Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check > Condition > Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per > sense data) > > Device picked up again on restart: > > Jul 14 15:04:15 blackfyre kernel: da0 at mps0 bus 0 scbus0 target 0 lun 0 > Jul 14 15:04:15 blackfyre kernel: da0: <ATA ST3000DM001-9YN1 CC4H> Fixed > Direct Access SCSI-6 device > Jul 14 15:04:15 blackfyre kernel: da0: 600.000MB/s transfers > Jul 14 15:04:15 blackfyre kernel: da0: Command Queueing enabled > Jul 14 15:04:15 blackfyre kernel: da0: 2861588MB (5860533168 512 byte > sectors: 255H 63S/T 364801C) > > Device going south a second time: > > Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 d > 7d 76 10 0 0 38 0 > Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI > Status Error > Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check > Condition > Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: ABORTED > COMMAND asc:47,3 (Information unit iuCRC error detected) > Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per > sense data) > Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 d > 94 b8 20 0 0 38 0 > Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI > Status Error > Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check > Condition > Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: ABORTED > COMMAND asc:47,3 (Information unit iuCRC error detected) > Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per > sense data) > > > Culminating in the device being removed from /dev/: > > Jul 14 18:39:17 blackfyre kernel: (noperiph:mps0:0:0:0): SMID 3 finished > recovery after aborting TaskMID 667 > Jul 14 18:39:17 blackfyre kernel: mps0: mpssas_free_tm releasing simq > Jul 14 18:39:22 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: > 0 0 0 0 0 0 length 0 SMID 969 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:39:25 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: > 0 0 0 0 0 0 length 0 SMID 774 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:39:29 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: > 0 0 0 0 0 0 length 0 SMID 880 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:39:33 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: > 0 0 0 0 0 0 length 0 SMID 722 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:39:33 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: > 0 0 0 0 0 0 length 0 SMID 244 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:39:37 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: > 0 0 0 0 0 0 length 0 SMID 911 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_alloc_tm freezing simq > Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_remove_complete on handle > 0x0009, IOCStatus= 0x0 > Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_free_tm releasing simq > Jul 14 18:40:02 blackfyre kernel: (da0:(pass0:mps0:0:0:mps0:0:0): lost > device - 3 outstanding, 2 refs > Jul 14 18:40:02 blackfyre kernel: 0:0): passdevgonecb: devfs entry is gone > Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 2 > Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 1 > Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 0 > Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): removing device entry > Jul 14 18:40:34 blackfyre kernel: mpssas_get_sata_identify: error reading > SATA PASSTHRU; iocstatus = 0x804b > Jul 14 18:40:34 blackfyre last message repeated 4 times > Jul 14 18:40:34 blackfyre kernel: _mapping_get_dev_info: failed to compute > the hashed SAS Address for SATA device with handle 0x0009 > Jul 14 18:40:34 blackfyre kernel: mpssas_get_sata_identify: error reading > SATA PASSTHRU; iocstatus = 0x804b > Jul 14 18:40:34 blackfyre last message repeated 4 times > Jul 14 18:41:10 blackfyre kernel: mps0: mpssas_alloc_tm freezing simq > Jul 14 18:41:12 blackfyre kernel: (probe0:mps0:0:6:0): INQUIRY. CDB: 12 0 > 0 0 24 0 length 36 SMID 75 terminated ioc 804b scsi 0 state c xfer 0 > Jul 14 18:41:12 blackfyre kernel: mps0: IOCStatus = 0x4b while resetting > device 0x9 > Jul 14 18:41:12 blackfyre kernel: mps0: mpssas_free_tm releasing simq > > I should note that `camcontrol rescan 0` (Or `camcontrol rescan all`) won't find da0.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAA=KUhvfPPTD=yq3JSguVED=hdUYi2grynhSwpfHN7Q0xfEyiQ>