From owner-freebsd-questions@FreeBSD.ORG Sun Jul 14 11:38:08 2013 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 4CA5291 for ; Sun, 14 Jul 2013 11:38:08 +0000 (UTC) (envelope-from jbirch@jbirch.net) Received: from mail-oa0-f52.google.com (mail-oa0-f52.google.com [209.85.219.52]) by mx1.freebsd.org (Postfix) with ESMTP id 1B33BDD7 for ; Sun, 14 Jul 2013 11:38:07 +0000 (UTC) Received: by mail-oa0-f52.google.com with SMTP id g12so14505563oah.25 for ; Sun, 14 Jul 2013 04:38:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=+O4FsH0SI5kbOUuon/6AU+yofTsTNctqPszNkxi1AHA=; b=fDdQ/5w9mGd59pQfgDv1yLoKgf/vzr+C3eq0OxPzF2mZWigUlGW0Qx0hhlAr9Ztqvr 0fEneeJ13ZpW9RQ2d8arCrr2+xkOvPnJBDMWl7BGRMiW4bN6WQeN386jbev6CJVJf/qp WU+vOKNRvqUrfhGaYP87o3ig0Xy9ChUzmx0IKZz1AIJACVy1RZHgRX13jYPgVwIoxRyf 4tGV/n9X3AvbvbCT0RFyJZy1wPiPUcQsHZ6Srea1uDPcOP6kxEz30BCMep8e1nI06Nyj ZXWzcQZVs8DoytGWmhDcsaIfLnnnpxBu82XCJ7HrV4S/684t/5ic/EO6trGBjqmPhuPE fTxg== MIME-Version: 1.0 X-Received: by 10.60.45.38 with SMTP id j6mr40421689oem.56.1373801887407; Sun, 14 Jul 2013 04:38:07 -0700 (PDT) Received: by 10.182.144.226 with HTTP; Sun, 14 Jul 2013 04:38:07 -0700 (PDT) X-Originating-IP: [101.175.139.248] Date: Sun, 14 Jul 2013 21:38:07 +1000 Message-ID: Subject: Re-add lost device entries without a reboot; troubleshoot RAID card From: Jason Birch To: "freebsd-questions@freebsd.org" X-Gm-Message-State: ALoCoQley3UtAZhVvze6Dsihd8C+M+PDm0iCAGUHaw6c59sfja/9jkt9TD4fihsKVVdsRWNlztTH Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Jul 2013 11:38:08 -0000 I have several hard drives running through an M1015 flashed to think it's an LSI 9211-8i IT. I've been running them successfully for the last three months through mps(4) as part of a raidz pool, but had the pool drop to a degraded state when /dev/da0 (and associated gpt device) disappeared after some apparent errors. After a reboot, I noticed that the disk that disappeared - da0 - was successfully probed and resilvered back in to the existing pool. I ran a short SMART self test and everything was fine. I ran a long SMART self test and the drive disappeared again towards the end of the scan (I didn't get a chance to view the results) I'd like to know if there's a way to suggest to 're-probe' connections to see if there are any devices that can be reconnected. It's clear that the drive is still around and at least partially responsive - is there a way I can online this disk, as just a device in its own right, such that I can finishing running the SMART diagnostics? I've read some old mentions of mps not being the most stable thing under load, but the mentions are over a year old. The initial failure happened right at the time the daily periodic was running (Which includes a check for negative permissions on the zfs partition) and the second failure was during a SMART long test, so I guess there's potential for "load" there. How might I go about diagnosing whether this is just the drive or possibly the card itself? I suppose the obvious "Move it off the raid card" is probably a good first start... $ uname -a FreeBSD blackfyre 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #0: Mon Jun 17 11:42:37 UTC 2013 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 dmesg output when things started going south the first time: Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 5a ca e4 98 0 0 8 0 length 4096 SMID 563 terminated ioc 804b scsi 0 state c xfer 0 Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 23 55 ec 58 0 0 8 0 length 4096 SMID 557 terminated ioc 804b scsi 0 state c xfer 0 Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 5a d7 a7 f8 0 0 8 0 length 4096 SMID 889 terminated ioc 804b scsi 0 state c xfer 0 Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 23 55 ec 60 0 0 8 0 length 4096 SMID 61 terminated ioc 804b scsi 0 state c xfer 0 Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 23 55 ec 60 0 0 8 0 Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check Condition Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per sense data) Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0 23 55 ec a0 0 0 8 0 Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check Condition Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per sense data) Device picked up again on restart: Jul 14 15:04:15 blackfyre kernel: da0 at mps0 bus 0 scbus0 target 0 lun 0 Jul 14 15:04:15 blackfyre kernel: da0: Fixed Direct Access SCSI-6 device Jul 14 15:04:15 blackfyre kernel: da0: 600.000MB/s transfers Jul 14 15:04:15 blackfyre kernel: da0: Command Queueing enabled Jul 14 15:04:15 blackfyre kernel: da0: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C) Device going south a second time: Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 d 7d 76 10 0 0 38 0 Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check Condition Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per sense data) Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 d 94 b8 20 0 0 38 0 Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check Condition Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per sense data) Culminating in the device being removed from /dev/: Jul 14 18:39:17 blackfyre kernel: (noperiph:mps0:0:0:0): SMID 3 finished recovery after aborting TaskMID 667 Jul 14 18:39:17 blackfyre kernel: mps0: mpssas_free_tm releasing simq Jul 14 18:39:22 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 969 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:39:25 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 774 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:39:29 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 880 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:39:33 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 722 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:39:33 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 244 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:39:37 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 911 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_alloc_tm freezing simq Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_remove_complete on handle 0x0009, IOCStatus= 0x0 Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_free_tm releasing simq Jul 14 18:40:02 blackfyre kernel: (da0:(pass0:mps0:0:0:mps0:0:0): lost device - 3 outstanding, 2 refs Jul 14 18:40:02 blackfyre kernel: 0:0): passdevgonecb: devfs entry is gone Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 2 Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 1 Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 0 Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): removing device entry Jul 14 18:40:34 blackfyre kernel: mpssas_get_sata_identify: error reading SATA PASSTHRU; iocstatus = 0x804b Jul 14 18:40:34 blackfyre last message repeated 4 times Jul 14 18:40:34 blackfyre kernel: _mapping_get_dev_info: failed to compute the hashed SAS Address for SATA device with handle 0x0009 Jul 14 18:40:34 blackfyre kernel: mpssas_get_sata_identify: error reading SATA PASSTHRU; iocstatus = 0x804b Jul 14 18:40:34 blackfyre last message repeated 4 times Jul 14 18:41:10 blackfyre kernel: mps0: mpssas_alloc_tm freezing simq Jul 14 18:41:12 blackfyre kernel: (probe0:mps0:0:6:0): INQUIRY. CDB: 12 0 0 0 24 0 length 36 SMID 75 terminated ioc 804b scsi 0 state c xfer 0 Jul 14 18:41:12 blackfyre kernel: mps0: IOCStatus = 0x4b while resetting device 0x9 Jul 14 18:41:12 blackfyre kernel: mps0: mpssas_free_tm releasing simq