From owner-freebsd-scsi@freebsd.org  Wed Mar  2 07:23:17 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F2F05AC0ABD
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Wed,  2 Mar 2016 07:23:17 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176a.smtpx.saremail.com (cu01176a.smtpx.saremail.com
 [195.16.150.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 71FB51CEB
 for <freebsd-scsi@freebsd.org>; Wed,  2 Mar 2016 07:23:16 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.96] (izaro.sarenet.es [192.148.167.11])
 by proxypop03.sare.net (Postfix) with ESMTPSA id 101CD9DC95E;
 Wed,  2 Mar 2016 08:23:07 +0100 (CET)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: mpr(4) SAS3008 Repeated Crashing
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <56D612FA.6090909@multiplay.co.uk>
Date: Wed, 2 Mar 2016 08:23:07 +0100
Cc: freebsd-scsi@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>
References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk>
To: Steven Hartland <killing@multiplay.co.uk>
X-Mailer: Apple Mail (2.3112)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Mar 2016 07:23:18 -0000


> On 01 Mar 2016, at 23:08, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
> Initial ideas would be bad signalling.
>=20
> If you have the option to drop the speeds down and that helps then =
almost certainly the case.
>=20
> The original mfi driver was very bad at recovering from issues like =
this too, I spent over a month fixing and patching it to get it working =
reliably when there where hardware related issues. In my case it turned =
out the be a dodge CPU causing memory corruption but you'll get similar =
behaviour from badly designed installs, particularly with expanders in =
play for high speed devices (6-12Gbps) link speed.

I=E2=80=99ve suffered similar problems, although not as severe, on one =
of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20=

connected to the backplane, using SATA SSDs. But mine are almost =
certainly hardware problems. An identical system is working
without issues.

The symptom: with high I/O activity, for example, running Bonnie++, some =
commands abort with the disks returning a
unit attention (power on/reset) asc 0,29.

it definitely this looks like a hardware problem to me. Might be the =
backplane
(it doesn=E2=80=99t affect the same disk every time, it=E2=80=99s =
completely random) or maybe a power supply problem making the disks =
reset?

And it hasn=E2=80=99t caused serious data corruption. (It=E2=80=99s =
decomissioned for now, of coursw!) Now and then ZFS will complain of a =
checksum failure, but a scrub
fixes it.

Now I=E2=80=99m fighting with IBM (now Lenovo) because all the =
components were sourced from them and it=E2=80=99s their call to debug =
it. Maybe I=E2=80=99ll hook an oscilloscope
to the power rails to check for suspicious transients or something like =
that, though. So far their response has been absolutely unacceptable. =
They ask for the
=E2=80=9CRAID vendor=E2=80=9D, and they seem unable to understand that =
someone might want to run these things with an OS different than =
Windows, and without creating
RAID volumes with the built in controller. Sigh.

Maybe I could bribe someone to pose as =E2=80=9CRAID vendor=E2=80=9D ;)


Feb 12 07:43:59 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): =
SMID 33 Aborting command 0xfffffe0000c7baf0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). =
CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 989 terminated ioc =
804b scsi 0 state c xfer 0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 953 terminated ioc 804b =
scsi 0 state c xfe(da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 =
00 00 20 00=20
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: =
Command timeout
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 571 terminated ioc 804b =
scsi 0 state c xfe(da14:r 0
Feb 12 07:43:59 clientes-ssd8 kernel: mpr0:0:	(da14:mpr0:0:40:0): ATA =
COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 =
06 00 length 512 SMID 638 te40:rminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). =
CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 818 terminated ioc =
804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 40 00 00 20 00 length 16384 SMID 952 terminated ioc 804b =
scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 922 terminated ioc 804b =
scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 823 terminated ioc 804b =
scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). =
CDB: 2a 00 39 a1 fe f0 00 00 20 00=20
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: =
SCSI Status Error
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI status: =
Check Condition
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI sense: =
UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): Retrying =
command (per sense data)


Borja.