Date: Wed, 2 Mar 2016 08:23:07 +0100 From: Borja Marcos <borjam@sarenet.es> To: Steven Hartland <killing@multiplay.co.uk> Cc: freebsd-scsi@freebsd.org Subject: Re: mpr(4) SAS3008 Repeated Crashing Message-ID: <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> In-Reply-To: <56D612FA.6090909@multiplay.co.uk> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 01 Mar 2016, at 23:08, Steven Hartland <killing@multiplay.co.uk> = wrote: >=20 > Initial ideas would be bad signalling. >=20 > If you have the option to drop the speeds down and that helps then = almost certainly the case. >=20 > The original mfi driver was very bad at recovering from issues like = this too, I spent over a month fixing and patching it to get it working = reliably when there where hardware related issues. In my case it turned = out the be a dodge CPU causing memory corruption but you'll get similar = behaviour from badly designed installs, particularly with expanders in = play for high speed devices (6-12Gbps) link speed. I=E2=80=99ve suffered similar problems, although not as severe, on one = of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20= connected to the backplane, using SATA SSDs. But mine are almost = certainly hardware problems. An identical system is working without issues. The symptom: with high I/O activity, for example, running Bonnie++, some = commands abort with the disks returning a unit attention (power on/reset) asc 0,29. it definitely this looks like a hardware problem to me. Might be the = backplane (it doesn=E2=80=99t affect the same disk every time, it=E2=80=99s = completely random) or maybe a power supply problem making the disks = reset? And it hasn=E2=80=99t caused serious data corruption. (It=E2=80=99s = decomissioned for now, of coursw!) Now and then ZFS will complain of a = checksum failure, but a scrub fixes it. Now I=E2=80=99m fighting with IBM (now Lenovo) because all the = components were sourced from them and it=E2=80=99s their call to debug = it. Maybe I=E2=80=99ll hook an oscilloscope to the power rails to check for suspicious transients or something like = that, though. So far their response has been absolutely unacceptable. = They ask for the =E2=80=9CRAID vendor=E2=80=9D, and they seem unable to understand that = someone might want to run these things with an OS different than = Windows, and without creating RAID volumes with the built in controller. Sigh. Maybe I could bribe someone to pose as =E2=80=9CRAID vendor=E2=80=9D ;) Feb 12 07:43:59 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): = SMID 33 Aborting command 0xfffffe0000c7baf0 Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). = CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 989 terminated ioc = 804b scsi 0 state c xfer 0 Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 953 terminated ioc 804b = scsi 0 state c xfe(da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 = 00 00 20 00=20 Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: = Command timeout Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 571 terminated ioc 804b = scsi 0 state c xfe(da14:r 0 Feb 12 07:43:59 clientes-ssd8 kernel: mpr0:0: (da14:mpr0:0:40:0): ATA = COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 = 06 00 length 512 SMID 638 te40:rminated ioc 804b scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). = CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 818 terminated ioc = 804b scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 40 00 00 20 00 length 16384 SMID 952 terminated ioc 804b = scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 922 terminated ioc 804b = scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 823 terminated ioc 804b = scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). = CDB: 2a 00 39 a1 fe f0 00 00 20 00=20 Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: = SCSI Status Error Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI status: = Check Condition Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI sense: = UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): Retrying = command (per sense data) Borja.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6>