Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 2 Mar 2016 11:43:23 -0700
From:      Scott Long <scott4long@yahoo.com>
To:        Borja Marcos <borjam@sarenet.es>
Cc:        Steven Hartland <killing@multiplay.co.uk>, freebsd-scsi@freebsd.org
Subject:   Re: mpr(4) SAS3008 Repeated Crashing
Message-ID:  <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com>
In-Reply-To: <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>
References:  <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Mar 2, 2016, at 12:23 AM, Borja Marcos <borjam@sarenet.es> wrote:
>=20
>=20
>> On 01 Mar 2016, at 23:08, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>>=20
>> Initial ideas would be bad signalling.
>>=20
>> If you have the option to drop the speeds down and that helps then =
almost certainly the case.
>>=20
>> The original mfi driver was very bad at recovering from issues like =
this too, I spent over a month fixing and patching it to get it working =
reliably when there where hardware related issues. In my case it turned =
out the be a dodge CPU causing memory corruption but you'll get similar =
behaviour from badly designed installs, particularly with expanders in =
play for high speed devices (6-12Gbps) link speed.
>=20
> I=E2=80=99ve suffered similar problems, although not as severe, on one =
of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20=

> connected to the backplane, using SATA SSDs. But mine are almost =
certainly hardware problems. An identical system is working
> without issues.
>=20
> The symptom: with high I/O activity, for example, running Bonnie++, =
some commands abort with the disks returning a
> unit attention (power on/reset) asc 0,29.
>=20

In your case, the UA is actually a secondary effect.  What=E2=80=99s =
happening is that a command is timing out so the driver is resetting the =
disk.  That causes the disk to report a UA with an ASC of 29/0 on the =
next command it gets after it comes back up.  It=E2=80=99s not fatal and =
I=E2=80=99m not sure if it should actually cause a retry, but that=E2=80=99=
s an investigation for a different time.  It does produce a lot of noise =
on the console/log, though.

One thing I noticed in your log is that one of the commands was a =
passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM.  =
It=E2=80=99s not clear if this command was at fault, I need to add =
better logging for this case, but it=E2=80=99s highly suspect.  It was =
only being asked to trim one sector, but given how unpredictable TRIM =
responses are from the drive, I don=E2=80=99t know if this matters.  =
What it might point to, though, is that either the timeout for the =
command was too short, the drive doesn=E2=80=99t support DSM TRIM that =
well, or the LSI adapter doesn=E2=80=99t support it well (since it=E2=80=99=
s not an NCQ command, the LSI firmware would have to remember to flush =
out the pending NCQ reads and writes first before doing the DSM =
command).  The default timeout is 60 seconds, which should be enough =
unless you changed it deliberately.  If this is a reproducible case, =
would you be willing to re-try with a different delete method, i.e. =
fiddle with the kern.cam.da.X.delete_method sysctl?

In any case, I doubt that the problem is with cabling.  Active =
backplanes have been known to cause problems with LSI controllers and =
SATA disks, but the problem that reported in your log doesn=E2=80=99t =
match the typical pattern for that.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E74F5225-1EA8-4B60-ADDC-7B13E1003184>