From owner-freebsd-scsi@freebsd.org Wed Mar 2 07:23:17 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F2F05AC0ABD for ; Wed, 2 Mar 2016 07:23:17 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176a.smtpx.saremail.com (cu01176a.smtpx.saremail.com [195.16.150.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 71FB51CEB for ; Wed, 2 Mar 2016 07:23:16 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.96] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id 101CD9DC95E; Wed, 2 Mar 2016 08:23:07 +0100 (CET) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: mpr(4) SAS3008 Repeated Crashing From: Borja Marcos In-Reply-To: <56D612FA.6090909@multiplay.co.uk> Date: Wed, 2 Mar 2016 08:23:07 +0100 Cc: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> To: Steven Hartland X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Mar 2016 07:23:18 -0000 > On 01 Mar 2016, at 23:08, Steven Hartland = wrote: >=20 > Initial ideas would be bad signalling. >=20 > If you have the option to drop the speeds down and that helps then = almost certainly the case. >=20 > The original mfi driver was very bad at recovering from issues like = this too, I spent over a month fixing and patching it to get it working = reliably when there where hardware related issues. In my case it turned = out the be a dodge CPU causing memory corruption but you'll get similar = behaviour from badly designed installs, particularly with expanders in = play for high speed devices (6-12Gbps) link speed. I=E2=80=99ve suffered similar problems, although not as severe, on one = of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20= connected to the backplane, using SATA SSDs. But mine are almost = certainly hardware problems. An identical system is working without issues. The symptom: with high I/O activity, for example, running Bonnie++, some = commands abort with the disks returning a unit attention (power on/reset) asc 0,29. it definitely this looks like a hardware problem to me. Might be the = backplane (it doesn=E2=80=99t affect the same disk every time, it=E2=80=99s = completely random) or maybe a power supply problem making the disks = reset? And it hasn=E2=80=99t caused serious data corruption. (It=E2=80=99s = decomissioned for now, of coursw!) Now and then ZFS will complain of a = checksum failure, but a scrub fixes it. Now I=E2=80=99m fighting with IBM (now Lenovo) because all the = components were sourced from them and it=E2=80=99s their call to debug = it. Maybe I=E2=80=99ll hook an oscilloscope to the power rails to check for suspicious transients or something like = that, though. So far their response has been absolutely unacceptable. = They ask for the =E2=80=9CRAID vendor=E2=80=9D, and they seem unable to understand that = someone might want to run these things with an OS different than = Windows, and without creating RAID volumes with the built in controller. Sigh. Maybe I could bribe someone to pose as =E2=80=9CRAID vendor=E2=80=9D ;) Feb 12 07:43:59 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): = SMID 33 Aborting command 0xfffffe0000c7baf0 Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). = CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 989 terminated ioc = 804b scsi 0 state c xfer 0 Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 953 terminated ioc 804b = scsi 0 state c xfe(da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 = 00 00 20 00=20 Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: = Command timeout Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 571 terminated ioc 804b = scsi 0 state c xfe(da14:r 0 Feb 12 07:43:59 clientes-ssd8 kernel: mpr0:0: (da14:mpr0:0:40:0): ATA = COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 = 06 00 length 512 SMID 638 te40:rminated ioc 804b scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). = CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 818 terminated ioc = 804b scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 40 00 00 20 00 length 16384 SMID 952 terminated ioc 804b = scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 922 terminated ioc 804b = scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): = originator(PL), code(0x12), sub_code(0x0440) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: = 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 823 terminated ioc 804b = scsi 0 state c xfer 0 Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). = CDB: 2a 00 39 a1 fe f0 00 00 20 00=20 Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: = SCSI Status Error Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI status: = Check Condition Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI sense: = UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): Retrying = command (per sense data) Borja.