From owner-freebsd-scsi@freebsd.org Wed Mar 2 18:46:13 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 65433AC2DE3 for ; Wed, 2 Mar 2016 18:46:13 +0000 (UTC) (envelope-from scott4long@yahoo.com) Received: from nm6-vm2.bullet.mail.gq1.yahoo.com (nm6-vm2.bullet.mail.gq1.yahoo.com [98.136.218.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 41A351E70 for ; Wed, 2 Mar 2016 18:46:13 +0000 (UTC) (envelope-from scott4long@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1456944204; bh=yuFWMEJ4l+2HQxeKHLeNOiZ4uYg2TyeWBgYUEwrSVaI=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject; b=X52tIAYdM2ern8bzsqjB3tStot9L2e9Goa8MtcCosmDpM7wX1oBJelK8lY5zytG8d3Ico8encgpUyar+pTCt22rx20Y8zrDNrtMsZu8K2FVfg/ENbZjikQc87XKKRsYjdXdKqSY+P9Bcf4eJOlnQLrE0dhLmW1rtOF2m/Ry5TVu3Ts2fkZiPH60zngxIYRPPIJBKBDDZXI77ac3PPNgIEnOii71HnFMLeQJo4EioexcJ7ZjsHbe6LYt+SIITj11Ay7YdTKGMN2Q2+ThzPgxvRPIvKC9WN7ELUwbp1jDqm5TzxfKyYCd6Yxhk8mnjt7zq9tPh21AgLR1rAFSMhdTQXw== Received: from [98.137.12.188] by nm6.bullet.mail.gq1.yahoo.com with NNFMP; 02 Mar 2016 18:43:24 -0000 Received: from [208.71.42.212] by tm9.bullet.mail.gq1.yahoo.com with NNFMP; 02 Mar 2016 18:43:24 -0000 Received: from [127.0.0.1] by smtp223.mail.gq1.yahoo.com with NNFMP; 02 Mar 2016 18:43:24 -0000 X-Yahoo-Newman-Id: 636728.20780.bm@smtp223.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: J8WZLZ0VM1n9EEL1jKxgHsjqHrToRO41.zj0dfXUUWq3rC5 E9Op.34zMjrONSIDO30NcV3wDniZVdHbFJ32qSqmMpl4.Ju0ghJMqKaoQaJK 59ggGDJMK4zkUmt8JTygLuYQ_0yMferdOCXSVQarBWCGxedB7LNgVFdskJg1 5aLQZ3P6NPADCiuPdvvhIh15VBL20NyE7eLjRVQ_mxNJ8Is4gLZt01PYjz18 j4IvWGXggy78HM8cGHuQoSCtGBdYQQSJR6_wvyR1dRm5rA0arwfvWQaSle9U cr8xZdXZwKHoRy4S8kXPy8wSGq9blZDzB0DdcoKzwj1PVw7eqRAX3eWCjuum Fi5HpviXQK92QGtpkypBGqZgX3zZvQkfmwmalQ2A_xQci0qWTRkK.17pi1IX bjjvhaxar6m8SvhkEJ8QOrE6oKhruONW9AnQjlg28L7plleQrmPkD9nKrF2z KfZ7q63Yw5v6QjZyb8EnHYH3aYmA.HFQ7QuI6ShTsEEUnR8RzLvOivL7uFKe j38iW82vuoDBmHbpw0liVNCK.Spzv2Vz8vNE4PhJ22UY.tg-- X-Yahoo-SMTP: clhABp.swBB7fs.LwIJpv3jkWgo2NU8- Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: mpr(4) SAS3008 Repeated Crashing From: Scott Long In-Reply-To: Date: Wed, 2 Mar 2016 11:43:23 -0700 Cc: Steven Hartland , freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> To: Borja Marcos X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Mar 2016 18:46:13 -0000 > On Mar 2, 2016, at 12:23 AM, Borja Marcos wrote: >=20 >=20 >> On 01 Mar 2016, at 23:08, Steven Hartland = wrote: >>=20 >> Initial ideas would be bad signalling. >>=20 >> If you have the option to drop the speeds down and that helps then = almost certainly the case. >>=20 >> The original mfi driver was very bad at recovering from issues like = this too, I spent over a month fixing and patching it to get it working = reliably when there where hardware related issues. In my case it turned = out the be a dodge CPU causing memory corruption but you'll get similar = behaviour from badly designed installs, particularly with expanders in = play for high speed devices (6-12Gbps) link speed. >=20 > I=E2=80=99ve suffered similar problems, although not as severe, on one = of my storage servers. It=E2=80=99s an IBM X Series with a LSI 3008 HBA=20= > connected to the backplane, using SATA SSDs. But mine are almost = certainly hardware problems. An identical system is working > without issues. >=20 > The symptom: with high I/O activity, for example, running Bonnie++, = some commands abort with the disks returning a > unit attention (power on/reset) asc 0,29. >=20 In your case, the UA is actually a secondary effect. What=E2=80=99s = happening is that a command is timing out so the driver is resetting the = disk. That causes the disk to report a UA with an ASC of 29/0 on the = next command it gets after it comes back up. It=E2=80=99s not fatal and = I=E2=80=99m not sure if it should actually cause a retry, but that=E2=80=99= s an investigation for a different time. It does produce a lot of noise = on the console/log, though. One thing I noticed in your log is that one of the commands was a = passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM. = It=E2=80=99s not clear if this command was at fault, I need to add = better logging for this case, but it=E2=80=99s highly suspect. It was = only being asked to trim one sector, but given how unpredictable TRIM = responses are from the drive, I don=E2=80=99t know if this matters. = What it might point to, though, is that either the timeout for the = command was too short, the drive doesn=E2=80=99t support DSM TRIM that = well, or the LSI adapter doesn=E2=80=99t support it well (since it=E2=80=99= s not an NCQ command, the LSI firmware would have to remember to flush = out the pending NCQ reads and writes first before doing the DSM = command). The default timeout is 60 seconds, which should be enough = unless you changed it deliberately. If this is a reproducible case, = would you be willing to re-try with a different delete method, i.e. = fiddle with the kern.cam.da.X.delete_method sysctl? In any case, I doubt that the problem is with cabling. Active = backplanes have been known to cause problems with LSI controllers and = SATA disks, but the problem that reported in your log doesn=E2=80=99t = match the typical pattern for that. Scott