From owner-freebsd-scsi@freebsd.org  Thu Mar  3 09:37:53 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9FF4FA93E0B
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Thu,  3 Mar 2016 09:37:53 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x22d.google.com (mail-wm0-x22d.google.com
 [IPv6:2a00:1450:400c:c09::22d])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 405E1CBD
 for <freebsd-scsi@freebsd.org>; Thu,  3 Mar 2016 09:37:52 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x22d.google.com with SMTP id l68so122454876wml.0
 for <freebsd-scsi@freebsd.org>; Thu, 03 Mar 2016 01:37:52 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:cc:from:message-id:date:user-agent
 :mime-version:in-reply-to:content-transfer-encoding;
 bh=gzutXeqdXCHZ2f9wwjx8fRPPEfWFY1ZrEj0Gmxri0w4=;
 b=W0Mw/TCT+SZGmW5D/JjedS00C23EmuBO1iTuL371tz+ucMUvaQME4Fzx9arYAJyAwE
 Fw0eaj6yQiStix48qqg2BFzWckDZ4YLs8sdHsbm9vFBSoV/CvvMOSJCPnneDONp2rQYM
 FVV0s8JTNhdZQ1R2sZer6gQtUtbEzt8oVxTWyXoxNmjvDE/GPfqz23zrybUs/y34ScVl
 NQsANOymfKC8bjJhzIvscJOHWr3m1w4TBN7Oxv0MAAWKsaXYmbbqb7125Rby4y9wTVb9
 2Hxq642/wtKy/cHElYAwLXxjyAqGslvSHR8HG0FqmlJMbY/rEQ2kn2gs9v9WtIXB1XRm
 vw3w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:cc:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-transfer-encoding;
 bh=gzutXeqdXCHZ2f9wwjx8fRPPEfWFY1ZrEj0Gmxri0w4=;
 b=KnHKaPmDy76X5P9s0GSASrIz4LPiW/aee+3juZusFTUMqrYUkL/YXEGIgZRhA1kmJ+
 Wm3tXvqmB9LdT67PYT5DEeP/OyfsiZvaVk4tHWyaXjyb/hdwJUjEbfEndwwdV3bScG+p
 0K0g4REP9DEWzslxnkZAX7aMUgdivnpvEfU+nD4fhMvnB0YjkIo+G8MIQHgQmKna9B3C
 Sj7/iigZxGzBMvdhQoFx6t+FaNqYoGteJ4sWBjuMId8WmMGut4gKNL2zFkCojrFK2Kk/
 MqywoOLvV9aYN5lox4PSgbL03nTsZgW33iExHBrcjdweS3dcRaO5dKiI02D1YlthbxH4
 uI7A==
X-Gm-Message-State: AD7BkJLro6U3MOYHeusIiMLD+XymoeA+0nky9RhzbiA18OAfrxHH6Js6cM6GF7e+YoA+f16R
X-Received: by 10.28.189.67 with SMTP id n64mr4947052wmf.24.1456997871171;
 Thu, 03 Mar 2016 01:37:51 -0800 (PST)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id 192sm8075417wmw.0.2016.03.03.01.37.49
 (version=TLSv1/SSLv3 cipher=OTHER);
 Thu, 03 Mar 2016 01:37:50 -0800 (PST)
Subject: Re: mpr(4) SAS3008 Repeated Crashing
To: Borja Marcos <borjam@sarenet.es>, Scott Long <scott4long@yahoo.com>
References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk>
 <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>
 <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com>
 <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es>
Cc: FreeBSD-scsi <freebsd-scsi@freebsd.org>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <56D805FD.50500@multiplay.co.uk>
Date: Thu, 3 Mar 2016 09:38:05 +0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Mar 2016 09:37:53 -0000


On 03/03/2016 07:42, Borja Marcos wrote:
>> On 02 Mar 2016, at 19:43, Scott Long <scott4long@yahoo.com> wrote:
>>> I=E2=80=99ve suffered similar problems, although not as severe, on on=
e of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=

>>> connected to the backplane, using SATA SSDs. But mine are almost cert=
ainly hardware problems. An identical system is working
>>> without issues.
>>>
>>> The symptom: with high I/O activity, for example, running Bonnie++, s=
ome commands abort with the disks returning a
>>> unit attention (power on/reset) asc 0,29.
>>>
>> In your case, the UA is actually a secondary effect.  What=E2=80=99s h=
appening is that a command is timing out so the driver is resetting the d=
isk.  That causes the disk to report a UA with an ASC of 29/0 on the next=
 command it gets after it comes back up.  It=E2=80=99s not fatal and I=E2=
=80=99m not sure if it should actually cause a retry, but that=E2=80=99s =
an investigation for a different time.  It does produce a lot of noise on=
 the
>> console/log, though.
This sounds similar to what we saw in mfi; while the cause was different =

the real problem was the error paths in the driver where untested and=20
buggy causing more problems and resulting in panics.

I was lucky, or unlucky depending on your point of view, that the HW=20
issue we had was very good at triggering pretty much every failure path=20
in the driver which allowed me to fix them, without that its really hard =

to truly test these code paths which hardly ever get exercised.
> Hmm. Interesting. It does indeed cause problems, although nothing that =
a ZFS scrub cannot fix.
>
> So it=E2=80=99s the driver that is resetting the disks? I was assuming =
that the disks were resetting themselves for some reason.
>
>> One thing I noticed in your log is that one of the commands was a pass=
through ATA command of 0x06 and feature of 0x01, which is DSM TRIM.  It=E2=
=80=99s not clear if this command was at fault, I need to add better logg=
ing for this case, but it=E2=80=99s highly suspect.  It was only being as=
ked to trim one sector, but given how unpredictable TRIM responses are fr=
om the drive, I don=E2=80=99t know if this matters.  What it might point =
to, though, is that either the timeout for the command was too short, the=
 drive doesn=E2=80=99t support DSM TRIM that well, or the LSI adapter doe=
sn=E2=80=99t support it well (since it=E2=80=99s not an NCQ command, the =
LSI firmware would have to remember to flush out the pending NCQ reads an=
d writes first before doing the DSM command).  The default timeout is 60 =
seconds, which should be enough unless you changed it deliberately.  If t=
his is a reproducible case, would you be willing to re-try with a differe=
nt delete method, i.e. fiddle with the kern.cam.da.X.delete_method sysctl=
?
> The server is not in production for now, so I can run experiments on it=
=2E I am trying with delete_method=3DDISABLE. Although using these disks =
without trim would have
> a performance impact I guess.
>
> What is puzzling is, the =E2=80=9Ctwin=E2=80=9D server is working like =
a charm. Same hardware, same software. We only updated firmwares on the a=
iling one when we noticed problems,
> just in case.
>
> Actually we=E2=80=99ve been poking the dealer and they are going to sen=
d a new one to test. Given how the twin works, the problem should go away=
=2E
>
We've seen HW issues before where the first thing to start triggering=20
the problem was TRIM requests, it seems like its an afterthought in most =

FW's unfortunately, so one of the first things to go bad. I'm not saying =

this is you issue, but its something to keep in mind.

     Regards
     Steve