From owner-freebsd-scsi@FreeBSD.ORG  Tue Nov  1 19:18:07 2011
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C5846106566C
	for <freebsd-scsi@freebsd.org>; Tue,  1 Nov 2011 19:18:07 +0000 (UTC)
	(envelope-from peter.maloney@brockmann-consult.de)
Received: from mo-p05-ob6.rzone.de (mo-p05-ob6.rzone.de
	[IPv6:2a01:238:20a:202:53f5::1])
	by mx1.freebsd.org (Postfix) with ESMTP id C91908FC18
	for <freebsd-scsi@freebsd.org>; Tue,  1 Nov 2011 19:18:06 +0000 (UTC)
X-RZG-AUTH: :LWIKdA2leu0bPbLmhzXgqn0MTG6qiKEwQRWfNxSw4HzYIwjsnvdDt2oX8drk23mpKMZH7NA=
X-RZG-CLASS-ID: mo05
Received: from [192.168.179.42]
	(hmbg-5f7606d1.pool.mediaWays.net [95.118.6.209])
	by post.strato.de (mrclete mo57) (RZmta 26.10 AUTH)
	with (DHE-RSA-AES128-SHA encrypted) ESMTPA id w01ed2nA1IP5pQ
	for <freebsd-scsi@freebsd.org>; Tue, 1 Nov 2011 20:17:45 +0100 (MET)
Message-ID: <4EAEF431.7090108@brockmann-consult.de>
Date: Mon, 31 Oct 2011 20:17:05 +0100
From: Peter Maloney <peter.maloney@brockmann-consult.de>
User-Agent: Mozilla/5.0 (Windows NT 5.1;
	rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1
MIME-Version: 1.0
To: freebsd-scsi@freebsd.org
References: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>
In-Reply-To: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Re: mps/LSI SAS2008 controller crashes when smartctl is run with
 upped disk tags
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Nov 2011 19:18:07 -0000

Dear Jason,

Am 01.11.2011 19:13, schrieb Jason Wolfe:
> Hello,
>
> I have an issue with the mps driver on 8.2 where running 'smartctl -a'
> rarely causes the controller to freak out when disk tags are > 2.  I've
> confirmed settings the tags to 1 resolves this crash, so that surely is a
> clue in the right direction..  I'm using Seagate 1TB SAS drives -
> ST91000640SS, and these are SuperMicro X8DTT-H chasis.  This happens across
> over a thousand servers, so it surely not flaky hardware.  It could
> obviously be some interoperability with these model drives and the mps
> controller, but unfortunately I don't have any other drives deployed on
> these cards to test that theory out :/
I get a simlar problem on a system with an LSI 9211-8i with 20 SATA
disks attached (2 SSDs and 18 spnning disks). My system doesn't hang,
panic, or reset though. I just lose access to one disk, which is then
considered FAULTED in my zpool status (with the ZFS file system). If I
physically remove the FAULTED disk and run "gpart recover da0", I get a
panic. Otherwise, the system keeps running in a degraded state.  When I
reboot and resilver, some data is found damaged and repaired, not just
refreshed with the latest state. The server has 1 HBA and 2 backplanes,
and I have the 2 mirrored root disks on different backplanes. Maybe that
is why mine runs degraded and yours hang.

This happened twice so far (in around a month or two), and both times it
was one of the mirrored root disks (SSDs) that faulted.

My tags are set to 255. I will try reproducing it as you said, and then
if it fails, rebooting and trying again setting tags to 2 as you suggested.

And *thank you very much for this information*. This is the last
outstanding issue with this server. I hope this workaround helps.

# camcontrol tags /dev/da0
(pass0:mps0:0:7:0): device openings: 255

>
> Luckily remote syslogging is enabled, so while nothing is kept locally, we
> see these messages similar to these transmitted before the server hangs,
> requiring a power cycle:
>
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 510
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 713
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 942
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 356
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 492
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 976
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 339
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 746
> (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 613
> (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16
> (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID
> 305
> (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 594
>
> In some cases that would be followed by this, which would usually be the
> last transmission, though we don't see this in all cases.  It may just be
> the system isn't always alive long enough to transmit:
>
> kernel: mps0: IOC Fault 0x40006003, Resetting
>
>
> I'm able to reproduce fairly easily within a minute or two by heavily
> loading the disks up by whatever means, and running smartctl -a in a loop:
>
> #!/bin/sh -x
>
> disks=`sysctl -n kern.disks|xargs -n1|grep ^da`
>
> for disk in $disks; do
> camcontrol tags $disk -N 4
> done
>
> for z in `yes|head -100`; do
> for disk in $disks; do
> smartctl -s on -a /dev/$disk
> done
> done
>
> mps0: <LSI SAS2008> port 0xe000-0xe0ff mem
> 0xfbd3c000-0xfbd3ffff,0xfbd40000-0xfbd7ffff irq 26 at device 0.0 on pci4
> mps0: Firmware: 07.00.00.00
> mps0: IOCCapabilities:
> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> mps0: [ITHREAD]
> da0 at mps0 bus 0 scbus0 target 1 lun 0
> da1 at mps0 bus 0 scbus0 target 2 lun 0
> da2 at mps0 bus 0 scbus0 target 3 lun 0
> da3 at mps0 bus 0 scbus0 target 4 lun 0
> da4 at mps0 bus 0 scbus0 target 5 lun 0
> da5 at mps0 bus 0 scbus0 target 6 lun 0
> da6 at mps0 bus 0 scbus0 target 7 lun 0
> da7 at mps0 bus 0 scbus0 target 8 lun 0
> da8 at mps0 bus 0 scbus0 target 9 lun 0
> da9 at mps0 bus 0 scbus0 target 10 lun 0
> da10 at mps0 bus 0 scbus0 target 11 lun 0
> da11 at mps0 bus 0 scbus0 target 12 lun 0
> ses0 at mps0 bus 0 scbus0 target 13 lun 0
>
> mps0@pci0:4:0:0: class=0x010700 card=0x040015d9 chip=0x00721000 rev=0x02
> hdr=0x00
> vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
> class = mass storage
> subclass = SAS
>
> <SEAGATE ST91000640SS 0001> at scbus0 target 1 lun 0 (pass0,da0)
> <SEAGATE ST91000640SS 0001> at scbus0 target 2 lun 0 (pass1,da1)
> <SEAGATE ST91000640SS 0001> at scbus0 target 3 lun 0 (pass2,da2)
> <SEAGATE ST91000640SS 0001> at scbus0 target 4 lun 0 (pass3,da3)
> <SEAGATE ST91000640SS 0001> at scbus0 target 5 lun 0 (pass4,da4)
> <SEAGATE ST91000640SS 0001> at scbus0 target 6 lun 0 (pass5,da5)
> <SEAGATE ST91000640SS 0001> at scbus0 target 7 lun 0 (pass6,da6)
> <SEAGATE ST91000640SS 0001> at scbus0 target 8 lun 0 (pass7,da7)
> <SEAGATE ST91000640SS 0001> at scbus0 target 9 lun 0 (pass8,da8)
> <SEAGATE ST91000640SS 0001> at scbus0 target 10 lun 0 (pass9,da9)
> <SEAGATE ST91000640SS 0001> at scbus0 target 11 lun 0 (pass10,da10)
> <SEAGATE ST91000640SS 0001> at scbus0 target 12 lun 0 (pass11,da11)
> <LSI CORP SAS2X28 0717> at scbus0 target 13 lun 0 (ses0,pass12)
>
> Thank you sirs,
>
> Jason Wolfe
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


And my logs to compare:

(note my root, swap, zfs cache, and zfs log is on the disk that fails).
)

root@bcnas1:/var/log# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/swap0     524288     5840   518448     1%
/dev/gpt/swap1     524288     5640   518648     1%
Total             1048576    11480  1037096     1%

When it starts happening, it looks like this:
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 220
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 87
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 795
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 423
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 338
Oct 29 02:19:12 bcnas1 kernel: :9:0): SCSI command timeout on device
handle 0x0016 SMID 170
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 637
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 335
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 798
Oct 29 02:19:12 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
Oct 29 02:19:12 bcnas1 last message repeated 14 times
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 991 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 4
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 4 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 227
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 227 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 652
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 652 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 125
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 125 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 101
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 1017 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 100
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 1004 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 487
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 487 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 279
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 279 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 929
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 929 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 346
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 346 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 817
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 817 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 170
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 170 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 637
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 637 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 335
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 335 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 798
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 798 complete
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 757
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 833
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 804
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 464
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 144
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 912
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 753
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 422
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 241

And then just before I rebooted it, basically looked the same, with the
different messages mixed together:

ct 31 07:52:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 807
ct 31 07:52:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 807 complete
ct 31 07:53:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 1006
ct 31 07:53:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 111
ct 31 07:53:20 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 1006 complete
ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 111
ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 111 complete
ct 31 07:54:20 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 669
ct 31 07:54:20 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 912
ct 31 07:54:28 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 669 complete
ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 912
ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 912 complete
ct 31 07:55:29 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 804
ct 31 07:55:29 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 1001
ct 31 07:55:36 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 804 complete
ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 1001
ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 1001 complete
ct 31 07:56:36 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 389
ct 31 07:56:36 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 885
ct 31 07:56:44 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 389 complete
ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 885
ct 31 07:56:44 bcnas1 kernel: swap_pager: I/O error - pageout failed;
blkno 131393,size 65536, error 5
ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 885 complete
ct 31 07:57:45 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 442
ct 31 07:57:48 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 442 complete
ct 31 07:58:49 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 413
ct 31 07:58:52 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 413 complete
ct 31 07:59:53 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 90
ct 31 07:59:56 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 90 complete
ct 31 08:00:56 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 504
ct 31 08:01:00 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 504 complete
ct 31 08:02:01 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 861
ct 31 08:02:04 bcnas1 kernel: mps0: swap_pager: I/O error - pageout
failed; blkno 131409,size 49152, error 5mpssas_abort_complete: abort
request on handle 0x16
SMID 861 complete