Date: Thu, 3 Mar 2016 08:42:20 +0100 From: Borja Marcos <borjam@sarenet.es> To: Scott Long <scott4long@yahoo.com> Cc: Steven Hartland <killing@multiplay.co.uk>, FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: Re: mpr(4) SAS3008 Repeated Crashing Message-ID: <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es> In-Reply-To: <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 02 Mar 2016, at 19:43, Scott Long <scott4long@yahoo.com> wrote: >> I=E2=80=99ve suffered similar problems, although not as severe, on = one of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 = HBA=20 >> connected to the backplane, using SATA SSDs. But mine are almost = certainly hardware problems. An identical system is working >> without issues. >>=20 >> The symptom: with high I/O activity, for example, running Bonnie++, = some commands abort with the disks returning a >> unit attention (power on/reset) asc 0,29. >>=20 >=20 > In your case, the UA is actually a secondary effect. What=E2=80=99s = happening is that a command is timing out so the driver is resetting the = disk. That causes the disk to report a UA with an ASC of 29/0 on the = next command it gets after it comes back up. It=E2=80=99s not fatal and = I=E2=80=99m not sure if it should actually cause a retry, but that=E2=80=99= s an investigation for a different time. It does produce a lot of noise = on the=20 > console/log, though. Hmm. Interesting. It does indeed cause problems, although nothing that a = ZFS scrub cannot fix.=20 So it=E2=80=99s the driver that is resetting the disks? I was assuming = that the disks were resetting themselves for some reason.=20 > One thing I noticed in your log is that one of the commands was a = passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM. = It=E2=80=99s not clear if this command was at fault, I need to add = better logging for this case, but it=E2=80=99s highly suspect. It was = only being asked to trim one sector, but given how unpredictable TRIM = responses are from the drive, I don=E2=80=99t know if this matters. = What it might point to, though, is that either the timeout for the = command was too short, the drive doesn=E2=80=99t support DSM TRIM that = well, or the LSI adapter doesn=E2=80=99t support it well (since it=E2=80=99= s not an NCQ command, the LSI firmware would have to remember to flush = out the pending NCQ reads and writes first before doing the DSM = command). The default timeout is 60 seconds, which should be enough = unless you changed it deliberately. If this is a reproducible case, = would you be willing to re-try with a different delete method, i.e. = fiddle with the kern.cam.da.X.delete_method sysctl? The server is not in production for now, so I can run experiments on it. = I am trying with delete_method=3DDISABLE. Although using these disks = without trim would have a performance impact I guess.=20 What is puzzling is, the =E2=80=9Ctwin=E2=80=9D server is working like a = charm. Same hardware, same software. We only updated firmwares on the = ailing one when we noticed problems, just in case. Actually we=E2=80=99ve been poking the dealer and they are going to send = a new one to test. Given how the twin works, the problem should go away. Thanks! Borja.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C>