From owner-freebsd-questions@FreeBSD.ORG  Mon Jul 15 09:18:40 2013
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id BD8873FE
 for <freebsd-questions@freebsd.org>; Mon, 15 Jul 2013 09:18:40 +0000 (UTC)
 (envelope-from jbirch@jbirch.net)
Received: from mail-ob0-f174.google.com (mail-ob0-f174.google.com
 [209.85.214.174]) by mx1.freebsd.org (Postfix) with ESMTP id 8BC40A8E
 for <freebsd-questions@freebsd.org>; Mon, 15 Jul 2013 09:18:40 +0000 (UTC)
Received: by mail-ob0-f174.google.com with SMTP id wd20so13605747obb.19
 for <freebsd-questions@freebsd.org>; Mon, 15 Jul 2013 02:18:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=mime-version:x-originating-ip:in-reply-to:references:date
 :message-id:subject:from:to:content-type:x-gm-message-state;
 bh=peW4Ty/BFWDbyXVpNTai7HWOTbZg4u0cjNR/pUBi2iQ=;
 b=cS2CefrikKGdnYjCnoNX8a37qTT2jZ8Cjhityf6njuf7RJXH3r5AyFAB9ym9iItkui
 ruIX7AVXgFZHW8rvdW6CisJCSu70v8MSH9YzfE2RjVfiPAHG4+XO37fZWSssRXnVBgrI
 KN5y08/gaCiwOT9ACnTjUiVlkSPIL/b0SIsIYNApk9ebejIRujSFIp+uwHHC8ZqKHhA9
 trl1rkaKt4XBHhqiZs8nhUl1O23letnjRZ1rDydlf5g+uQyqZrhSGVxQuhx0XEcpp4Wu
 nTT67dSdiJjS8F7u+Xw+ozutXUXjvgAhNjDHYegfnOxyGHUFVGWdd9ddDrSshC/tC9Db
 UJ1g==
MIME-Version: 1.0
X-Received: by 10.182.33.103 with SMTP id q7mr43046210obi.77.1373879914029;
 Mon, 15 Jul 2013 02:18:34 -0700 (PDT)
Received: by 10.182.144.226 with HTTP; Mon, 15 Jul 2013 02:18:33 -0700 (PDT)
X-Originating-IP: [101.175.139.248]
In-Reply-To: <CAA=KUhtLWe1h2z_sS8vCPRWf-nvv=asVBUFrpPyVuVuyksTPfA@mail.gmail.com>
References: <CAA=KUhtLWe1h2z_sS8vCPRWf-nvv=asVBUFrpPyVuVuyksTPfA@mail.gmail.com>
Date: Mon, 15 Jul 2013 19:18:33 +1000
Message-ID: <CAA=KUhvfPPTD=yq3JSguVED=hdUYi2grynhSwpfHN7Q0xfEyiQ@mail.gmail.com>
Subject: Re: Re-add lost device entries without a reboot;
 troubleshoot RAID card
From: Jason Birch <jbirch@jbirch.net>
To: "freebsd-questions@freebsd.org" <freebsd-questions@freebsd.org>
X-Gm-Message-State: ALoCoQkweeTP6jw7s8/tPK21dLNPIBz4XdgPOrOVhTb3zcQDGrYkpPfZA9tZKk+kmrawU5oiIpZG
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Jul 2013 09:18:40 -0000

On Sun, Jul 14, 2013 at 9:38 PM, Jason Birch <jbirch@jbirch.net> wrote:

> I have several hard drives running through an M1015 flashed to think it's
> an LSI 9211-8i IT. I've been running them successfully for the last three
> months through mps(4) as part of a raidz pool, but had the pool drop to a
> degraded state when /dev/da0 (and associated gpt device) disappeared after
> some apparent errors.
>
> After a reboot, I noticed that the disk that disappeared - da0 - was
> successfully probed and resilvered back in to the existing pool. I ran a
> short SMART self test and everything was fine. I ran a long SMART self test
> and the drive disappeared again towards the end of the scan (I didn't get a
> chance to view the results)
>
> I'd like to know if there's a way to suggest to 're-probe' connections to
> see if there are any devices that can be reconnected. It's clear that the
> drive is still around and at least partially responsive - is there a way I
> can online this disk, as just a device in its own right, such that I can
> finishing running the SMART diagnostics?
>
> I've read some old mentions of mps not being the most stable thing under
> load, but the mentions are over a year old. The initial failure happened
> right at the time the daily periodic was running (Which includes a check
> for negative permissions on the zfs partition) and the second failure was
> during a SMART long test, so I guess there's potential for "load" there.
> How might I go about diagnosing whether this is just the drive or possibly
> the card itself? I suppose the obvious "Move it off the raid card" is
> probably a good first start...
>
> $ uname -a
> FreeBSD blackfyre 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #0: Mon Jun 17
> 11:42:37 UTC 2013     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
>  amd64
>
> dmesg output when things started going south the first time:
>
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0
> 5a ca e4 98 0 0 8 0 length 4096 SMID 563 terminated ioc 804b scsi 0 state c
> xfer 0
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0
> 23 55 ec 58 0 0 8 0 length 4096 SMID 557 terminated ioc 804b scsi 0 state c
> xfer 0
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0
> 5a d7 a7 f8 0 0 8 0 length 4096 SMID 889 terminated ioc 804b scsi 0 state c
> xfer 0
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0
> 23 55 ec 60 0 0 8 0 length 4096 SMID 61 terminated ioc 804b scsi 0 state c
> xfer 0
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0
> 23 55 ec 60 0 0 8 0
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI
> Status Error
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check
> Condition
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jul 11 03:07:20 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per
> sense data)
> Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): WRITE(10). CDB: 2a 0
> 23 55 ec a0 0 0 8 0
> Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI
> Status Error
> Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check
> Condition
> Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jul 11 03:07:25 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per
> sense data)
>
> Device picked up again on restart:
>
> Jul 14 15:04:15 blackfyre kernel: da0 at mps0 bus 0 scbus0 target 0 lun 0
> Jul 14 15:04:15 blackfyre kernel: da0: <ATA ST3000DM001-9YN1 CC4H> Fixed
> Direct Access SCSI-6 device
> Jul 14 15:04:15 blackfyre kernel: da0: 600.000MB/s transfers
> Jul 14 15:04:15 blackfyre kernel: da0: Command Queueing enabled
> Jul 14 15:04:15 blackfyre kernel: da0: 2861588MB (5860533168 512 byte
> sectors: 255H 63S/T 364801C)
>
> Device going south a second time:
>
> Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 d
> 7d 76 10 0 0 38 0
> Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI
> Status Error
> Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check
> Condition
> Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: ABORTED
> COMMAND asc:47,3 (Information unit iuCRC error detected)
> Jul 14 18:36:56 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per
> sense data)
> Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 d
> 94 b8 20 0 0 38 0
> Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): CAM status: SCSI
> Status Error
> Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): SCSI status: Check
> Condition
> Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): SCSI sense: ABORTED
> COMMAND asc:47,3 (Information unit iuCRC error detected)
> Jul 14 18:37:02 blackfyre kernel: (da0:mps0:0:0:0): Retrying command (per
> sense data)
>
>
> Culminating in the device being removed from /dev/:
>
> Jul 14 18:39:17 blackfyre kernel: (noperiph:mps0:0:0:0): SMID 3 finished
> recovery after aborting TaskMID 667
> Jul 14 18:39:17 blackfyre kernel: mps0: mpssas_free_tm releasing simq
> Jul 14 18:39:22 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB:
> 0 0 0 0 0 0 length 0 SMID 969 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:39:25 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB:
> 0 0 0 0 0 0 length 0 SMID 774 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:39:29 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB:
> 0 0 0 0 0 0 length 0 SMID 880 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:39:33 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB:
> 0 0 0 0 0 0 length 0 SMID 722 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:39:33 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB:
> 0 0 0 0 0 0 length 0 SMID 244 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:39:37 blackfyre kernel: (da0:mps0:0:0:0): TEST UNIT READY. CDB:
> 0 0 0 0 0 0 length 0 SMID 911 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_alloc_tm freezing simq
> Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_remove_complete on handle
> 0x0009, IOCStatus= 0x0
> Jul 14 18:40:02 blackfyre kernel: mps0: mpssas_free_tm releasing simq
> Jul 14 18:40:02 blackfyre kernel: (da0:(pass0:mps0:0:0:mps0:0:0): lost
> device - 3 outstanding, 2 refs
> Jul 14 18:40:02 blackfyre kernel: 0:0): passdevgonecb: devfs entry is gone
> Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 2
> Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 1
> Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): oustanding 0
> Jul 14 18:40:02 blackfyre kernel: (da0:mps0:0:0:0): removing device entry
> Jul 14 18:40:34 blackfyre kernel: mpssas_get_sata_identify: error reading
> SATA PASSTHRU; iocstatus = 0x804b
> Jul 14 18:40:34 blackfyre last message repeated 4 times
> Jul 14 18:40:34 blackfyre kernel: _mapping_get_dev_info: failed to compute
> the hashed SAS Address for SATA device with handle 0x0009
> Jul 14 18:40:34 blackfyre kernel: mpssas_get_sata_identify: error reading
> SATA PASSTHRU; iocstatus = 0x804b
> Jul 14 18:40:34 blackfyre last message repeated 4 times
> Jul 14 18:41:10 blackfyre kernel: mps0: mpssas_alloc_tm freezing simq
> Jul 14 18:41:12 blackfyre kernel: (probe0:mps0:0:6:0): INQUIRY. CDB: 12 0
> 0 0 24 0 length 36 SMID 75 terminated ioc 804b scsi 0 state c xfer 0
> Jul 14 18:41:12 blackfyre kernel: mps0: IOCStatus = 0x4b while resetting
> device 0x9
> Jul 14 18:41:12 blackfyre kernel: mps0: mpssas_free_tm releasing simq
>
>
I should note that `camcontrol rescan 0` (Or `camcontrol rescan all`) won't
find da0.