Date: Wed, 2 Mar 2016 11:43:23 -0700 From: Scott Long <scott4long@yahoo.com> To: Borja Marcos <borjam@sarenet.es> Cc: Steven Hartland <killing@multiplay.co.uk>, freebsd-scsi@freebsd.org Subject: Re: mpr(4) SAS3008 Repeated Crashing Message-ID: <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> In-Reply-To: <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Mar 2, 2016, at 12:23 AM, Borja Marcos <borjam@sarenet.es> wrote: >=20 >=20 >> On 01 Mar 2016, at 23:08, Steven Hartland <killing@multiplay.co.uk> = wrote: >>=20 >> Initial ideas would be bad signalling. >>=20 >> If you have the option to drop the speeds down and that helps then = almost certainly the case. >>=20 >> The original mfi driver was very bad at recovering from issues like = this too, I spent over a month fixing and patching it to get it working = reliably when there where hardware related issues. In my case it turned = out the be a dodge CPU causing memory corruption but you'll get similar = behaviour from badly designed installs, particularly with expanders in = play for high speed devices (6-12Gbps) link speed. >=20 > I=E2=80=99ve suffered similar problems, although not as severe, on one = of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20= > connected to the backplane, using SATA SSDs. But mine are almost = certainly hardware problems. An identical system is working > without issues. >=20 > The symptom: with high I/O activity, for example, running Bonnie++, = some commands abort with the disks returning a > unit attention (power on/reset) asc 0,29. >=20 In your case, the UA is actually a secondary effect. What=E2=80=99s = happening is that a command is timing out so the driver is resetting the = disk. That causes the disk to report a UA with an ASC of 29/0 on the = next command it gets after it comes back up. It=E2=80=99s not fatal and = I=E2=80=99m not sure if it should actually cause a retry, but that=E2=80=99= s an investigation for a different time. It does produce a lot of noise = on the console/log, though. One thing I noticed in your log is that one of the commands was a = passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM. = It=E2=80=99s not clear if this command was at fault, I need to add = better logging for this case, but it=E2=80=99s highly suspect. It was = only being asked to trim one sector, but given how unpredictable TRIM = responses are from the drive, I don=E2=80=99t know if this matters. = What it might point to, though, is that either the timeout for the = command was too short, the drive doesn=E2=80=99t support DSM TRIM that = well, or the LSI adapter doesn=E2=80=99t support it well (since it=E2=80=99= s not an NCQ command, the LSI firmware would have to remember to flush = out the pending NCQ reads and writes first before doing the DSM = command). The default timeout is 60 seconds, which should be enough = unless you changed it deliberately. If this is a reproducible case, = would you be willing to re-try with a different delete method, i.e. = fiddle with the kern.cam.da.X.delete_method sysctl? In any case, I doubt that the problem is with cabling. Active = backplanes have been known to cause problems with LSI controllers and = SATA disks, but the problem that reported in your log doesn=E2=80=99t = match the typical pattern for that. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E74F5225-1EA8-4B60-ADDC-7B13E1003184>