Date: Thu, 3 Mar 2016 09:38:05 +0000 From: Steven Hartland <killing@multiplay.co.uk> To: Borja Marcos <borjam@sarenet.es>, Scott Long <scott4long@yahoo.com> Cc: FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: Re: mpr(4) SAS3008 Repeated Crashing Message-ID: <56D805FD.50500@multiplay.co.uk> In-Reply-To: <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es>
next in thread | previous in thread | raw e-mail | index | archive | help
On 03/03/2016 07:42, Borja Marcos wrote: >> On 02 Mar 2016, at 19:43, Scott Long <scott4long@yahoo.com> wrote: >>> I=E2=80=99ve suffered similar problems, although not as severe, on on= e of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA= >>> connected to the backplane, using SATA SSDs. But mine are almost cert= ainly hardware problems. An identical system is working >>> without issues. >>> >>> The symptom: with high I/O activity, for example, running Bonnie++, s= ome commands abort with the disks returning a >>> unit attention (power on/reset) asc 0,29. >>> >> In your case, the UA is actually a secondary effect. What=E2=80=99s h= appening is that a command is timing out so the driver is resetting the d= isk. That causes the disk to report a UA with an ASC of 29/0 on the next= command it gets after it comes back up. It=E2=80=99s not fatal and I=E2= =80=99m not sure if it should actually cause a retry, but that=E2=80=99s = an investigation for a different time. It does produce a lot of noise on= the >> console/log, though. This sounds similar to what we saw in mfi; while the cause was different = the real problem was the error paths in the driver where untested and=20 buggy causing more problems and resulting in panics. I was lucky, or unlucky depending on your point of view, that the HW=20 issue we had was very good at triggering pretty much every failure path=20 in the driver which allowed me to fix them, without that its really hard = to truly test these code paths which hardly ever get exercised. > Hmm. Interesting. It does indeed cause problems, although nothing that = a ZFS scrub cannot fix. > > So it=E2=80=99s the driver that is resetting the disks? I was assuming = that the disks were resetting themselves for some reason. > >> One thing I noticed in your log is that one of the commands was a pass= through ATA command of 0x06 and feature of 0x01, which is DSM TRIM. It=E2= =80=99s not clear if this command was at fault, I need to add better logg= ing for this case, but it=E2=80=99s highly suspect. It was only being as= ked to trim one sector, but given how unpredictable TRIM responses are fr= om the drive, I don=E2=80=99t know if this matters. What it might point = to, though, is that either the timeout for the command was too short, the= drive doesn=E2=80=99t support DSM TRIM that well, or the LSI adapter doe= sn=E2=80=99t support it well (since it=E2=80=99s not an NCQ command, the = LSI firmware would have to remember to flush out the pending NCQ reads an= d writes first before doing the DSM command). The default timeout is 60 = seconds, which should be enough unless you changed it deliberately. If t= his is a reproducible case, would you be willing to re-try with a differe= nt delete method, i.e. fiddle with the kern.cam.da.X.delete_method sysctl= ? > The server is not in production for now, so I can run experiments on it= =2E I am trying with delete_method=3DDISABLE. Although using these disks = without trim would have > a performance impact I guess. > > What is puzzling is, the =E2=80=9Ctwin=E2=80=9D server is working like = a charm. Same hardware, same software. We only updated firmwares on the a= iling one when we noticed problems, > just in case. > > Actually we=E2=80=99ve been poking the dealer and they are going to sen= d a new one to test. Given how the twin works, the problem should go away= =2E > We've seen HW issues before where the first thing to start triggering=20 the problem was TRIM requests, it seems like its an afterthought in most = FW's unfortunately, so one of the first things to go bad. I'm not saying = this is you issue, but its something to keep in mind. Regards Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56D805FD.50500>