From owner-freebsd-scsi@FreeBSD.ORG  Tue Nov  1 18:42:03 2011
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7879D106564A
	for <freebsd-scsi@freebsd.org>; Tue,  1 Nov 2011 18:42:03 +0000 (UTC)
	(envelope-from nitroboost@gmail.com)
Received: from mail-dy0-f54.google.com (mail-dy0-f54.google.com
	[209.85.220.54])
	by mx1.freebsd.org (Postfix) with ESMTP id EE6068FC16
	for <freebsd-scsi@freebsd.org>; Tue,  1 Nov 2011 18:42:02 +0000 (UTC)
Received: by dye36 with SMTP id 36so397915dye.13
	for <freebsd-scsi@freebsd.org>; Tue, 01 Nov 2011 11:42:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:date:message-id:subject:from:to:content-type;
	bh=3trq6LFWsnHWqs3tuc6y7zF9rP+zyikr5hs04y/L/rs=;
	b=Fugmkxctt2nkyllbtSUgysM7z1Fu7JrL3e6rypghXRVVmM+Hs5UuUuFuYODecwqyzM
	GRetBFz+3GoAL3pRibYtB2RN1dbc+fsEZSOIJCxgJmL0HdN8j4OJgyJ+U8g9GvF+qvxW
	s2Y1hkd8bYvz2r17AHIWWO0Y6y7XNIWxp6r+M=
MIME-Version: 1.0
Received: by 10.182.115.40 with SMTP id jl8mr157403obb.8.1320171197190; Tue,
	01 Nov 2011 11:13:17 -0700 (PDT)
Received: by 10.182.35.193 with HTTP; Tue, 1 Nov 2011 11:13:17 -0700 (PDT)
Date: Tue, 1 Nov 2011 11:13:17 -0700
Message-ID: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>
From: Jason Wolfe <nitroboost@gmail.com>
To: freebsd-scsi@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: mps/LSI SAS2008 controller crashes when smartctl is run with upped
 disk tags
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Nov 2011 18:42:03 -0000

Hello,

I have an issue with the mps driver on 8.2 where running 'smartctl -a'
rarely causes the controller to freak out when disk tags are > 2.  I've
confirmed settings the tags to 1 resolves this crash, so that surely is a
clue in the right direction..  I'm using Seagate 1TB SAS drives -
ST91000640SS, and these are SuperMicro X8DTT-H chasis.  This happens across
over a thousand servers, so it surely not flaky hardware.  It could
obviously be some interoperability with these model drives and the mps
controller, but unfortunately I don't have any other drives deployed on
these cards to test that theory out :/

Luckily remote syslogging is enabled, so while nothing is kept locally, we
see these messages similar to these transmitted before the server hangs,
requiring a power cycle:

(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
510
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
713
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
942
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
356
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
492
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
976
(da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
339
(da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
746
(da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74
(da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
613
(da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16
(da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID
305
(da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74
(da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
594

In some cases that would be followed by this, which would usually be the
last transmission, though we don't see this in all cases.  It may just be
the system isn't always alive long enough to transmit:

kernel: mps0: IOC Fault 0x40006003, Resetting


I'm able to reproduce fairly easily within a minute or two by heavily
loading the disks up by whatever means, and running smartctl -a in a loop:

#!/bin/sh -x

disks=`sysctl -n kern.disks|xargs -n1|grep ^da`

for disk in $disks; do
camcontrol tags $disk -N 4
done

for z in `yes|head -100`; do
for disk in $disks; do
smartctl -s on -a /dev/$disk
done
done

mps0: <LSI SAS2008> port 0xe000-0xe0ff mem
0xfbd3c000-0xfbd3ffff,0xfbd40000-0xfbd7ffff irq 26 at device 0.0 on pci4
mps0: Firmware: 07.00.00.00
mps0: IOCCapabilities:
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps0: [ITHREAD]
da0 at mps0 bus 0 scbus0 target 1 lun 0
da1 at mps0 bus 0 scbus0 target 2 lun 0
da2 at mps0 bus 0 scbus0 target 3 lun 0
da3 at mps0 bus 0 scbus0 target 4 lun 0
da4 at mps0 bus 0 scbus0 target 5 lun 0
da5 at mps0 bus 0 scbus0 target 6 lun 0
da6 at mps0 bus 0 scbus0 target 7 lun 0
da7 at mps0 bus 0 scbus0 target 8 lun 0
da8 at mps0 bus 0 scbus0 target 9 lun 0
da9 at mps0 bus 0 scbus0 target 10 lun 0
da10 at mps0 bus 0 scbus0 target 11 lun 0
da11 at mps0 bus 0 scbus0 target 12 lun 0
ses0 at mps0 bus 0 scbus0 target 13 lun 0

mps0@pci0:4:0:0: class=0x010700 card=0x040015d9 chip=0x00721000 rev=0x02
hdr=0x00
vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
class = mass storage
subclass = SAS

<SEAGATE ST91000640SS 0001> at scbus0 target 1 lun 0 (pass0,da0)
<SEAGATE ST91000640SS 0001> at scbus0 target 2 lun 0 (pass1,da1)
<SEAGATE ST91000640SS 0001> at scbus0 target 3 lun 0 (pass2,da2)
<SEAGATE ST91000640SS 0001> at scbus0 target 4 lun 0 (pass3,da3)
<SEAGATE ST91000640SS 0001> at scbus0 target 5 lun 0 (pass4,da4)
<SEAGATE ST91000640SS 0001> at scbus0 target 6 lun 0 (pass5,da5)
<SEAGATE ST91000640SS 0001> at scbus0 target 7 lun 0 (pass6,da6)
<SEAGATE ST91000640SS 0001> at scbus0 target 8 lun 0 (pass7,da7)
<SEAGATE ST91000640SS 0001> at scbus0 target 9 lun 0 (pass8,da8)
<SEAGATE ST91000640SS 0001> at scbus0 target 10 lun 0 (pass9,da9)
<SEAGATE ST91000640SS 0001> at scbus0 target 11 lun 0 (pass10,da10)
<SEAGATE ST91000640SS 0001> at scbus0 target 12 lun 0 (pass11,da11)
<LSI CORP SAS2X28 0717> at scbus0 target 13 lun 0 (ses0,pass12)

Thank you sirs,

Jason Wolfe