Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 2 Mar 2016 08:23:07 +0100
From:      Borja Marcos <borjam@sarenet.es>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-scsi@freebsd.org
Subject:   Re: mpr(4) SAS3008 Repeated Crashing
Message-ID:  <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>
In-Reply-To: <56D612FA.6090909@multiplay.co.uk>
References:  <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 01 Mar 2016, at 23:08, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
> Initial ideas would be bad signalling.
>=20
> If you have the option to drop the speeds down and that helps then =
almost certainly the case.
>=20
> The original mfi driver was very bad at recovering from issues like =
this too, I spent over a month fixing and patching it to get it working =
reliably when there where hardware related issues. In my case it turned =
out the be a dodge CPU causing memory corruption but you'll get similar =
behaviour from badly designed installs, particularly with expanders in =
play for high speed devices (6-12Gbps) link speed.

I=E2=80=99ve suffered similar problems, although not as severe, on one =
of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20=

connected to the backplane, using SATA SSDs. But mine are almost =
certainly hardware problems. An identical system is working
without issues.

The symptom: with high I/O activity, for example, running Bonnie++, some =
commands abort with the disks returning a
unit attention (power on/reset) asc 0,29.

it definitely this looks like a hardware problem to me. Might be the =
backplane
(it doesn=E2=80=99t affect the same disk every time, it=E2=80=99s =
completely random) or maybe a power supply problem making the disks =
reset?

And it hasn=E2=80=99t caused serious data corruption. (It=E2=80=99s =
decomissioned for now, of coursw!) Now and then ZFS will complain of a =
checksum failure, but a scrub
fixes it.

Now I=E2=80=99m fighting with IBM (now Lenovo) because all the =
components were sourced from them and it=E2=80=99s their call to debug =
it. Maybe I=E2=80=99ll hook an oscilloscope
to the power rails to check for suspicious transients or something like =
that, though. So far their response has been absolutely unacceptable. =
They ask for the
=E2=80=9CRAID vendor=E2=80=9D, and they seem unable to understand that =
someone might want to run these things with an OS different than =
Windows, and without creating
RAID volumes with the built in controller. Sigh.

Maybe I could bribe someone to pose as =E2=80=9CRAID vendor=E2=80=9D ;)



Feb 12 07:43:59 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): =
SMID 33 Aborting command 0xfffffe0000c7baf0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). =
CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 989 terminated ioc =
804b scsi 0 state c xfer 0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 953 terminated ioc 804b =
scsi 0 state c xfe(da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 =
00 00 20 00=20
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: =
Command timeout
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 571 terminated ioc 804b =
scsi 0 state c xfe(da14:r 0
Feb 12 07:43:59 clientes-ssd8 kernel: mpr0:0:	(da14:mpr0:0:40:0): ATA =
COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 =
06 00 length 512 SMID 638 te40:rminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). =
CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 818 terminated ioc =
804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 40 00 00 20 00 length 16384 SMID 952 terminated ioc 804b =
scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 922 terminated ioc 804b =
scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): =
originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: =
28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 823 terminated ioc 804b =
scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). =
CDB: 2a 00 39 a1 fe f0 00 00 20 00=20
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: =
SCSI Status Error
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI status: =
Check Condition
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI sense: =
UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): Retrying =
command (per sense data)






Borja.






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6>