From owner-freebsd-scsi@freebsd.org Sun Apr 24 13:35:49 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 23EAEB19BEA for ; Sun, 24 Apr 2016 13:35:49 +0000 (UTC) (envelope-from dan@langille.org) Received: from clavin2.langille.org (clavin2.langille.org [199.233.228.197]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "clavin.langille.org", Issuer "StartCom Class 2 Primary Intermediate Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 015891B46 for ; Sun, 24 Apr 2016 13:35:48 +0000 (UTC) (envelope-from dan@langille.org) Received: from (clavin2.int.langille.org (clavin2.int.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) with ESMTPSA id 18B7E18F41 for ; Sun, 24 Apr 2016 13:35:41 +0000 (UTC) From: Dan Langille Subject: terminated ioc 804b scsi 0 state c xfer 0 Message-Id: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> Date: Sun, 24 Apr 2016 09:35:41 -0400 To: freebsd-scsi@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Apr 2016 13:35:49 -0000 More of the pasted output is also at = https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae = and = added smartctl output. I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008 = card. It's running a zfs root system. This morning the system was unresponsive via ssh. Attempts to log in at = the console did not yield a password prompt. A power cycle brought the system online. Inspecting /var/log/messages, = I found about 63,000 entries similar to those which appear below. zpool status of all are OK. A scrub is in progress for one pool (since = before this issue arose). da7 is in that pool. Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8d = 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = 6f ef 50 00 00 08 00 length 4096 SMID 674 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b = 0f a2 48 00 00 18 00 length 12288 SMID 177 terminated ioc 804b scsi 0 = state c xfer 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 ab = 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 70 00 00 20 00 length 16384 SMID 376 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 = state c xfer 0 Is this a cabling issue? The drive is a SATA device (smartctl output in = the URL above). Anyone familiar with these errors? --=20 Dan Langille - BSDCan / PGCon dan@langille.org From owner-freebsd-scsi@freebsd.org Sun Apr 24 14:16:50 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 57E8BB1A2E3 for ; Sun, 24 Apr 2016 14:16:50 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ob0-x22c.google.com (mail-ob0-x22c.google.com [IPv6:2607:f8b0:4003:c01::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1E4811E87; Sun, 24 Apr 2016 14:16:50 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ob0-x22c.google.com with SMTP id n10so61024829obb.2; Sun, 24 Apr 2016 07:16:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc; bh=8h+F2F48+YvVy5+hjsZaJOiBXdKGnccALcGlQMH0gsY=; b=ryXPNI4trKZwes9xpMFXYQo0m1tNxueSIsv31UCkbObLw3Nb3ovGgF1D1Ze8s65NYZ lxEKlUqRpUvbglxQpcZ5uOCfa3knN1n+mQo7yLHFaezUa68slnC3c2kq55gKWsU3yvTu e7UBPblcHgLbUtx+MyGcwgStyHy1gH5AO/9iIyeDAg0aKRH1Df9GayBa9MeuENKVOdGl vPzPxWeVjBaKH/Re9YtIc/HtZ/KPEhb66/NPKZNZG5ZEdU+VUT0lHNxnYMe/ICUTRZ0G /xNSP2cNHbN8CJ/WZZcleX2OnbkV7fx9hxRWzFEMnMdgwvNiuMOkDk9Fn5GJ+2EwturP mtBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc; bh=8h+F2F48+YvVy5+hjsZaJOiBXdKGnccALcGlQMH0gsY=; b=m0Hvq6L2rL2OiujFAt4gkeb/+jE3V9tEQA3r3G1ypJmg/8ZHalgzwCgCL+awvy9LdT lLsmMeJH5cGNf4v/CysHyI5bBz8+00JhC0lVpBb3gYHutogtHk4tqG7k+ODpwDRw8Xwl TKoNx9n7frv3Fu4o1rsaElSEcilCwpimMh+SV/9xOH51mU9vms8K+a9ljhqGzZf3QSMC 1/QNMI/mJv6oBnQ7t0/kMO4wGxl4757OVICNY25Lgy9tzEMY09ep8iVueEgacZIXJE97 tjU1olZ0FaYPkFYGohU3jQgBY5qYVmDm7uJCxzk/eXYVdwFkJauxQo1+IcUc0UJqhTHA 3J3g== X-Gm-Message-State: AOPr4FXXDThcCXDNba504HizZgJ9c5Zq8WTn5JkmjA4usTPle86OG01fF0/OdHyCFNF875zyC9upVSSz4X8FYg== MIME-Version: 1.0 X-Received: by 10.182.28.168 with SMTP id c8mr13440747obh.49.1461507409404; Sun, 24 Apr 2016 07:16:49 -0700 (PDT) Sender: asomers@gmail.com Received: by 10.202.64.138 with HTTP; Sun, 24 Apr 2016 07:16:49 -0700 (PDT) In-Reply-To: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> Date: Sun, 24 Apr 2016 08:16:49 -0600 X-Google-Sender-Auth: 1_CorwXyq5rrc1c8UZtGTPDMh58 Message-ID: Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 From: Alan Somers To: Dan Langille Cc: FreeBSD-scsi , slm@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Apr 2016 14:16:50 -0000 On Sun, Apr 24, 2016 at 7:35 AM, Dan Langille wrote: > More of the pasted output is also at > https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae < > https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae> and > added smartctl output. > > I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008 > card. It's running a zfs root system. > > This morning the system was unresponsive via ssh. Attempts to log in at > the console did not yield a password prompt. > > A power cycle brought the system online. Inspecting /var/log/messages, I > found about 63,000 entries similar to those which appear below. > > zpool status of all are OK. A scrub is in progress for one pool (since > before this issue arose). da7 is in that pool. > > > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8d 90 > c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 > 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 > 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 > 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b 6f > ef 50 00 00 08 00 length 4096 SMID 674 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b > 0f a2 48 00 00 18 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state > c xfer 12288 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 ab 8f > a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 > 97 70 00 00 20 00 length 16384 SMID 376 terminated ioc 804b scsi 0 state c > xfer 0 > Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 > 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 state c > xfer 0 > > Is this a cabling issue? The drive is a SATA device (smartctl output in > the URL above). Anyone familiar with these errors? > > -- > Dan Langille - BSDCan / PGCon > dan@langille.org > "terminated ioc" means that the HBA decided to terminate the command. "804b" is an LSI internal code. Steve (CC'd) might be able to make sense of it. I doubt there's anything wrong with your cabling, but if a power cycle fixed the problem, you might've had a firmware crash in the HDD, HBA, or expander. -Alan From owner-freebsd-scsi@freebsd.org Mon Apr 25 08:36:25 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B70D3B118E7 for ; Mon, 25 Apr 2016 08:36:25 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176a.smtpx.saremail.com (cu01176a.smtpx.saremail.com [195.16.150.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5FFD41B71 for ; Mon, 25 Apr 2016 08:36:24 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id B7ACB9DD0AB; Mon, 25 Apr 2016 10:29:05 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: mpr(4) SAS3008 Repeated Crashing, LSI's spiritual advice would be appreciated From: Borja Marcos In-Reply-To: <56D96C84.7070507@multiplay.co.uk> Date: Mon, 25 Apr 2016 10:29:04 +0200 Cc: Scott Long , FreeBSD-scsi Content-Transfer-Encoding: quoted-printable Message-Id: <610C4F08-C1A4-4AB4-87B3-1253C45F8C38@sarenet.es> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <56D805FD.50500@multiplay.co.uk> <56D95266.301@multiplay.co.uk> <56D96C84.7070507@multiplay.co.uk> To: Steven Hartland X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 08:36:25 -0000 > On 04 Mar 2016, at 12:07, Steven Hartland = wrote: >=20 > On 04/03/2016 10:58, Borja Marcos wrote: >>> On 04 Mar 2016, at 10:16, Steven Hartland = wrote: >>>=20 >>> Its very rare but we've also seen this type of behaviour from a = failing Intel CPU. There was no other indication the CPU had an issue, = which one might expect, so just wanted to make you aware of the = possibility. >>>=20 >>> That said the most common cause of this we've seen, when its not a = common disk or disks, is a bad backplane or cabling to the backplane. >> Now I=E2=80=99m really curious! >>=20 >> How did you determine that it was the CPU? And what kind of issue was = it causing? Noise in the power rails? Interference? > After a month or so of fixing mfi so it recovered from all bad events = and prevented all the various kernel panics, the machine stayed running = long enough to log an MCA which pointed to a failing CPU cache. >=20 > We we're lucky it was CPU #2 so we disabled all cores for said CPU in = /boot/loader.conf and all the issues disappeared. We replaced the CPU = and no more issues. >=20 > We we're in the same situation as you, two machines identical configs, = one which was constantly panicing in mfi the other was rock solid. An update, long due. After the compliete inaction by IBM=E2=80=99 so = called =E2=80=9Csupport=E2=80=9D who blamed us for using non official = operating systems, we complained quite loudly (and harshly) and they accepted to =E2=80=9Creplace a = backplane for mere reasons of customer satisfaction=E2=80=9D. Despite me = insisting to bring also a HBA because we really didn=C2=B4t know what was wrong.=20 So they sent a technician with one of the three almost passive boards of = the backplane, even though I told them that the issue was spread among = the 24 disks, not just a group of 8. He changed one of them at random (I was on vacation = when he came) and, as I imagined, the issue wasn=E2=80=99t solved at = all. Tired of dealing with them I pulled the SAS3 HBA and installed a classic = LSI2008 card. A nightmare in itself, because the stupid firmware of the = IBM hangs during boot (=E2=80=9Cconnecting RAID adapters and boot devices=E2=80=9D or = something like that, I left it like that for 24 hours just to see if it = eventually exited the loop). I had to erase the boot services flash from the HBA even though I had already disabled BIOS = and UEFI services for the riser PCI card. Anyway I digress. Repeating all of our tests, with the LSI2008 card everything works like = a charm, although I=E2=80=99ve seen some surprising behavior. I spent a = lot of time running benchmarks. I could repeat the error condition in less than an hour = fairly reliably with the LSI3008 card, and I was unable to reproduce the = error with the LSI2008. Of course, these days this is the most sure you can be, unless someone = presents you with a proper oscilloscope and SAS pod. I even suggested = that to IBM, offering to do a serious diagnosis of the problem for them ;) The odd behavior, for which LSI=E2=80=99s spiritual advice would be = welcome, is this: 6 minutes after booting the system, while doing a = scrub in order to generate I/O load, and before beginning to run the error triggering benchmarks, I = saw some surprising messages on /var/log/messages: =E2=80=94=E2=80=94=E2=80=94 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: Element = descriptor: 'SLOT 000' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: SAS Device Slot = Element: 1 Phys at Slot 0, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd99 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: Element = descriptor: 'SLOT 001' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: SAS Device Slot = Element: 1 Phys at Slot 1, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9a Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: Element = descriptor: 'SLOT 002' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: SAS Device Slot = Element: 1 Phys at Slot 2, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9b Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: Element = descriptor: 'SLOT 003' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: SAS Device Slot = Element: 1 Phys at Slot 3, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9c Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: Element = descriptor: 'SLOT 004' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: SAS Device Slot = Element: 1 Phys at Slot 4, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9d Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: Element = descriptor: 'SLOT 005' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: SAS Device Slot = Element: 1 Phys at Slot 5, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9e Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: Element = descriptor: 'SLOT 006' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: SAS Device Slot = Element: 1 Phys at Slot 6, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9f Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: Element = descriptor: 'SLOT 007' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: SAS Device Slot = Element: 1 Phys at Slot 7, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fda0 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: Element = descriptor: 'SLOT 008' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: SAS Device Slot = Element: 1 Phys at Slot 8, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd91 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: Element = descriptor: 'SLOT 009' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: SAS Device Slot = Element: 1 Phys at Slot 9, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd92 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: Element = descriptor: 'SLOT 010' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: SAS Device Slot = Element: 1 Phys at Slot 10, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd93 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: Element = descriptor: 'SLOT 011' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: SAS Device Slot = Element: 1 Phys at Slot 11, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd94 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: Element = descriptor: 'SLOT 012' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: SAS Device Slot = Element: 1 Phys at Slot 12, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd95 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: Element = descriptor: 'SLOT 013' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: SAS Device Slot = Element: 1 Phys at Slot 13, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd96 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: Element = descriptor: 'SLOT 014' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: SAS Device Slot = Element: 1 Phys at Slot 14, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd97 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: Element = descriptor: 'SLOT 015' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: SAS Device Slot = Element: 1 Phys at Slot 15, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd98 =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 And at 17:41, something similar: =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: Element = descriptor: 'SLOT 016' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: SAS Device Slot = Element: 1 Phys at Slot 16, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d721 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: Element = descriptor: 'SLOT 017' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: SAS Device Slot = Element: 1 Phys at Slot 17, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d722 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: Element = descriptor: 'SLOT 018' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: SAS Device Slot = Element: 1 Phys at Slot 18, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d723 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: Element = descriptor: 'SLOT 019' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: SAS Device Slot = Element: 1 Phys at Slot 19, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d724 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: Element = descriptor: 'SLOT 020' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: SAS Device Slot = Element: 1 Phys at Slot 20, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d725 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: Element = descriptor: 'SLOT 021' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: SAS Device Slot = Element: 1 Phys at Slot 21, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d726 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: Element = descriptor: 'SLOT 022' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: SAS Device Slot = Element: 1 Phys at Slot 22, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d727 =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 After those events I did a scrub just in case, and no errors were found. = Can it be some expander oddity that somewhat confused the LSI3008 and not the LSI2008? The system is working as a charm anyway, but I wonder if there=E2=80=99s = some non obvious problem waiting to become a time bomb. Regarding IBM, well, unless we can fix this the expensive piece of = hardware it will be scrapped. And I really doubt any piece of kit from IBM/Lenovo (seems that Lenovo is in charge of = support for these servers now) will be purchased here on my watch, ever. Thanks, Borja. From owner-freebsd-scsi@freebsd.org Mon Apr 25 09:37:28 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 61F1CB1A23E for ; Mon, 25 Apr 2016 09:37:28 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com [IPv6:2a00:1450:400c:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EA06F104E for ; Mon, 25 Apr 2016 09:37:27 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x22f.google.com with SMTP id v188so90523733wme.1 for ; Mon, 25 Apr 2016 02:37:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=Pu0eNrCS7Ylwdw4Kqq8AI/V7/ozfhquDTCsDZHiArqo=; b=xgdTD3xPG96VVMwYlv4bSPHcrObOY3FHNTNfLTMjPSzTYDB0ojeDatfD3CkZF3Ka8f AX9ZNPglGq8FzhvHAly0lDTsJ93Fyk3vHTjGy94PUWbhbdWCCadbnnyVERLPcusM7yQM yWT6VzlsgZMxc6wVW1FOE4w20VGpdQYvl5q9Cztdwe+9UoSuM0FcUrs6J2205UWq4RK5 avLZU8yuaTkgR7Vo1vYu7CLTMjes65WhqpfP1KX4tmoXWiAaa0As86oLrZIHto0nxZ8C MZik6Qo5CwG21YzZk2tdAPyi6XL4fsgypRWxnqYZwb5T3OArp0N9QrTgXSc/oGXcTlcv +2ew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=Pu0eNrCS7Ylwdw4Kqq8AI/V7/ozfhquDTCsDZHiArqo=; b=ET/AkQ59AXzUouJGwcoL3aE7aArqfucG83jhLTLWcVI7c4eia20hcjbumZ1TtYWR// Vrom1yWuBVNwJOR3g9yGXSH6sHYGZzxQO1msVlcsvBRyta4pcgRivbwRScOdP4B7egAM VdzaxpdLwDFZsWcj0c20TN4IF4pgmRPiNgxMTd09MLa5q+zsk17Fc3wIfv6zG1+irfv6 Ow40EnVIxuBmT6BcU6QoOVWGLTiaWVzUlPrrjuDqT+2Ua0kffWK4cfWmJNAFw+m4R7rS ty2xYLLLuEQ8Y5FF4FB1vyplsyzNTGHCTORUweB0uIx4Fc7UxYFXg8M4/BTcifsdCWov EeVQ== X-Gm-Message-State: AOPr4FW5GXsmqoAKiI0+lq1SKGMWaLuYQHUrqWVTB2y/ml9+lLyoi+xkrdjorsgrMb3iPUL+ X-Received: by 10.28.10.7 with SMTP id 7mr10601459wmk.43.1461577044838; Mon, 25 Apr 2016 02:37:24 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id ug8sm7804016wjc.42.2016.04.25.02.37.23 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Apr 2016 02:37:23 -0700 (PDT) Subject: Re: mpr(4) SAS3008 Repeated Crashing, LSI's spiritual advice would be appreciated To: Borja Marcos References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <56D805FD.50500@multiplay.co.uk> <56D95266.301@multiplay.co.uk> <56D96C84.7070507@multiplay.co.uk> <610C4F08-C1A4-4AB4-87B3-1253C45F8C38@sarenet.es> Cc: Scott Long , FreeBSD-scsi From: Steven Hartland Message-ID: <571DE557.3020305@multiplay.co.uk> Date: Mon, 25 Apr 2016 10:37:27 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: <610C4F08-C1A4-4AB4-87B3-1253C45F8C38@sarenet.es> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 09:37:28 -0000 On 25/04/2016 09:29, Borja Marcos wrote: >> On 04 Mar 2016, at 12:07, Steven Hartland wrote: >> >> On 04/03/2016 10:58, Borja Marcos wrote: >>>> On 04 Mar 2016, at 10:16, Steven Hartland wrote: >>>> >>>> Its very rare but we've also seen this type of behaviour from a failing Intel CPU. There was no other indication the CPU had an issue, which one might expect, so just wanted to make you aware of the possibility. >>>> >>>> That said the most common cause of this we've seen, when its not a common disk or disks, is a bad backplane or cabling to the backplane. >>> Now I’m really curious! >>> >>> How did you determine that it was the CPU? And what kind of issue was it causing? Noise in the power rails? Interference? >> After a month or so of fixing mfi so it recovered from all bad events and prevented all the various kernel panics, the machine stayed running long enough to log an MCA which pointed to a failing CPU cache. >> >> We we're lucky it was CPU #2 so we disabled all cores for said CPU in /boot/loader.conf and all the issues disappeared. We replaced the CPU and no more issues. >> >> We we're in the same situation as you, two machines identical configs, one which was constantly panicing in mfi the other was rock solid. > An update, long due. After the compliete inaction by IBM’ so called “support” who blamed us for using non official operating systems, we complained > quite loudly (and harshly) and they accepted to “replace a backplane for mere reasons of customer satisfaction”. Despite me insisting to bring also > a HBA because we really didn´t know what was wrong. > > So they sent a technician with one of the three almost passive boards of the backplane, even though I told them that the issue was spread among the 24 disks, not > just a group of 8. He changed one of them at random (I was on vacation when he came) and, as I imagined, the issue wasn’t solved at all. > > Tired of dealing with them I pulled the SAS3 HBA and installed a classic LSI2008 card. A nightmare in itself, because the stupid firmware of the IBM hangs during > boot (“connecting RAID adapters and boot devices” or something like that, I left it like that for 24 hours just to see if it eventually exited the loop). I had to erase the > boot services flash from the HBA even though I had already disabled BIOS and UEFI services for the riser PCI card. Anyway I digress. > > Repeating all of our tests, with the LSI2008 card everything works like a charm, although I’ve seen some surprising behavior. I spent a lot of time running > benchmarks. I could repeat the error condition in less than an hour fairly reliably with the LSI3008 card, and I was unable to reproduce the error with the LSI2008. > Of course, these days this is the most sure you can be, unless someone presents you with a proper oscilloscope and SAS pod. I even suggested that to IBM, > offering to do a serious diagnosis of the problem for them ;) > > The odd behavior, for which LSI’s spiritual advice would be welcome, is this: 6 minutes after booting the system, while doing a scrub in order to generate > I/O load, and before beginning to run the error triggering benchmarks, I saw some surprising messages on /var/log/messages: > > > ——— > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: Element descriptor: 'SLOT 000' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: SAS Device Slot Element: 1 Phys at Slot 0, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd99 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: Element descriptor: 'SLOT 001' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: SAS Device Slot Element: 1 Phys at Slot 1, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9a > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: Element descriptor: 'SLOT 002' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: SAS Device Slot Element: 1 Phys at Slot 2, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9b > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: Element descriptor: 'SLOT 003' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: SAS Device Slot Element: 1 Phys at Slot 3, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9c > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: Element descriptor: 'SLOT 004' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: SAS Device Slot Element: 1 Phys at Slot 4, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9d > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: Element descriptor: 'SLOT 005' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: SAS Device Slot Element: 1 Phys at Slot 5, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9e > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: Element descriptor: 'SLOT 006' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: SAS Device Slot Element: 1 Phys at Slot 6, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9f > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: Element descriptor: 'SLOT 007' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: SAS Device Slot Element: 1 Phys at Slot 7, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fda0 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: Element descriptor: 'SLOT 008' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: SAS Device Slot Element: 1 Phys at Slot 8, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd91 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: Element descriptor: 'SLOT 009' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: SAS Device Slot Element: 1 Phys at Slot 9, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd92 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: Element descriptor: 'SLOT 010' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: SAS Device Slot Element: 1 Phys at Slot 10, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd93 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: Element descriptor: 'SLOT 011' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 11, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd94 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: Element descriptor: 'SLOT 012' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: SAS Device Slot Element: 1 Phys at Slot 12, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd95 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: Element descriptor: 'SLOT 013' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: SAS Device Slot Element: 1 Phys at Slot 13, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd96 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: Element descriptor: 'SLOT 014' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: SAS Device Slot Element: 1 Phys at Slot 14, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd97 > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: Element descriptor: 'SLOT 015' > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: SAS Device Slot Element: 1 Phys at Slot 15, Not All Phys > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device > Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd98 > > —————— > > > > And at 17:41, something similar: > > > > —————— > > > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: Element descriptor: 'SLOT 016' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: SAS Device Slot Element: 1 Phys at Slot 16, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d721 > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: Element descriptor: 'SLOT 017' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: SAS Device Slot Element: 1 Phys at Slot 17, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d722 > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: Element descriptor: 'SLOT 018' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: SAS Device Slot Element: 1 Phys at Slot 18, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d723 > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: Element descriptor: 'SLOT 019' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: SAS Device Slot Element: 1 Phys at Slot 19, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d724 > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: Element descriptor: 'SLOT 020' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: SAS Device Slot Element: 1 Phys at Slot 20, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d725 > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: Element descriptor: 'SLOT 021' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: SAS Device Slot Element: 1 Phys at Slot 21, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d726 > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: Element descriptor: 'SLOT 022' > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: SAS Device Slot Element: 1 Phys at Slot 22, Not All Phys > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device > Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d727 > > ——————— > > > After those events I did a scrub just in case, and no errors were found. Can it be some expander oddity that somewhat > confused the LSI3008 and not the LSI2008? > > The system is working as a charm anyway, but I wonder if there’s some non obvious problem waiting to become a time bomb. > > Regarding IBM, well, unless we can fix this the expensive piece of hardware it will be scrapped. And I really doubt > any piece of kit from IBM/Lenovo (seems that Lenovo is in charge of support for these servers now) will be purchased here on > my watch, ever. > 2008 is 6Gbps component, 3008 is a 12Gbps one so if you have 12Gbps capable devices its quite possible that where a 2008 works fine, negotiates at 6Gbps, the 3008 could fail @ 12Gbps due to the tighter tolerances required from all components. We had similar issues when chassis first started moving from 3Gbps to 6Gbps, in fact we found that Dell shipped drives with amended firmware that limited their negotiation speed down to 3Gbps specifically to workaround signalling issues in their chassis, even though they advertised them as 6Gbps compatible. Regards Steve From owner-freebsd-scsi@freebsd.org Mon Apr 25 12:17:46 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5023AB1B370 for ; Mon, 25 Apr 2016 12:17:46 +0000 (UTC) (envelope-from dan@langille.org) Received: from clavin1.langille.org (clavin.langille.org [162.208.116.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "clavin.langille.org", Issuer "StartCom Class 2 Primary Intermediate Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2F4FB1F2E for ; Mon, 25 Apr 2016 12:17:45 +0000 (UTC) (envelope-from dan@langille.org) Received: from (clavin1.int.langille.org (clavin1.int.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) with ESMTPSA id 43D373D2 for ; Mon, 25 Apr 2016 12:17:30 +0000 (UTC) From: Dan Langille Message-Id: <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 Date: Mon, 25 Apr 2016 08:17:30 -0400 References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> To: freebsd-scsi@freebsd.org In-Reply-To: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> X-Mailer: Apple Mail (2.3124) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 12:17:46 -0000 >=20 > On Apr 24, 2016, at 9:35 AM, Dan Langille wrote: >=20 > More of the pasted output is also at = https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae = and = added smartctl output. >=20 > I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008 = card. It's running a zfs root system. >=20 > This morning the system was unresponsive via ssh. Attempts to log in = at the console did not yield a password prompt. >=20 > A power cycle brought the system online. Inspecting = /var/log/messages, I found about 63,000 entries similar to those which = appear below. >=20 > zpool status of all are OK. A scrub is in progress for one pool (since = before this issue arose). da7 is in that pool. >=20 >=20 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 = 8b 0f a2 48 00 00 18 00 length 12288 SMID 177 terminated ioc 804b scsi 0 = state c xfer 12288 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 376 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 = state c xfer 0 >=20 > Is this a cabling issue? The drive is a SATA device (smartctl output = in the URL above). Anyone familiar with these errors? This morning: 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 At least I know i'm looking for Serial Number: 13Q8PNBYS =46rom the logs: Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0 Apr 25 05:34:50 knew kernel: da7: s/n = 13Q8PNBYS detached Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 = 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0 = state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 = 33 26 f8 00 00 20 00 length 16384 SMID 204 terminated ioc 804b scsi 0 = state c xfer(da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 53 e0 00 00 08 = 00=20 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 = 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi 0 = state c xfer(da7: 0 Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). = CDB: 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc = 804b scsi 0 state c xfer 17:0 Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): WRITE(10). = CDB: 2a 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc = 804b scsi 0 state c xfeError 5, Periph was invalidated Apr 25 05:34:51 knew kernel: r 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 = 33 26 f8 00 00 20 00=20 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d = ac ed b8 00 00 08 00 length 4096 SMID 435 terminated ioc 804b scsi 0 = state c xfer (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue = Request Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: mps1:0:IOCStatus =3D 0x4b while = resetting device 0xa Apr 25 05:34:51 knew kernel: 17:mps1: 0): Unfreezing devq for target ID = 17 Apr 25 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 = 33 26 d8 00 00 20 00=20 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 = 6c 42 40 00 00 10 00=20 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 = d8 2a 90 00 00 90 00=20 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d = ac ed b8 00 00 08 00=20 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device swap: provider da7p2 = disconnected. Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev = is removed, pool_guid=3D15378250086669402288 = vdev_guid=3D13410079654596185797'' Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr 25 05:34:51 knew ZFS: vdev is removed, = pool_guid=3D15378250086669402288 vdev_guid=3D13410079654596185797 --=20 Dan Langille - BSDCan / PGCon dan@langille.org From owner-freebsd-scsi@freebsd.org Mon Apr 25 15:35:41 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A210DB1C971 for ; Mon, 25 Apr 2016 15:35:41 +0000 (UTC) (envelope-from scott4long@yahoo.com) Received: from nm8-vm9.bullet.mail.gq1.yahoo.com (nm8-vm9.bullet.mail.gq1.yahoo.com [98.136.218.232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7F66B1DE4 for ; Mon, 25 Apr 2016 15:35:41 +0000 (UTC) (envelope-from scott4long@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1461598351; bh=3c21/c+vh1hpjXB5ahtBTV6Hj4Wg1KhtqRawsU/ba50=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject; b=NSfFBtakreadWRvG0gnuGltmVqA3E7lK/kkoiQl7u4gCaSAU3fkkkqCG44qtIEvlu8oB/l7WK/T5+qoJ1ZF0SVdVrGvaGaiVokG6313SzTF+cK++a0MRteSH6Zq0TGY5UmpAsWE6olsIZiLNEGbLmq7upoGFfh9cvEUT2ni7vxwlPeL51Z8G7mxstWxEO7/D+vK2zbIlV0nj2VjChMKeB7LnISHwQRvZztB0YvxbPKnIAIq6InBkBWw2SIaPjhGYpsd9uUM8NljMt12hU2cQmVr6zck01jGfj/WecJ9kbt+XcuWQFmD7HCkXe1VOVxSaFT8grl5Vx0NST62zpWlvtg== Received: from [98.137.12.58] by nm8.bullet.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 15:32:31 -0000 Received: from [208.71.42.207] by tm3.bullet.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 15:32:31 -0000 Received: from [127.0.0.1] by smtp218.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 15:32:31 -0000 X-Yahoo-Newman-Id: 531711.51360.bm@smtp218.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: Xz1wmzgVM1kr2w812Rie9KvESNiqtuUSM74SyGmkJ0BaD6q jFcr88QwzZ2EQlzEmB494ssnD53qeoDTt.XLOXWFJ6G3Nm7vPC1D9xQ3dkiW thg_J2p5lIJKjlCD__K1d7PXTblEX_oyZDJDbgN8gi3Qr0UxLGlHXbgcIe6J 6SMNQb4g_vA9ICVK3SXjfaz9SjK017a0eIR3LcQFPYIYCUu3MaEMPWreJnwH bzyzmaTwXhQG7fRobGdW85LEbBolCHgOYtVaalPsVS7wEbxq123iUhdgPd8s a8WVa4SRKRgVm9C1acJQrx5eY5jM4U.ValyTRK3Sc07BQVQvqeuAzsHG138. .4iwQY.l702lhh_AyB._OEWJ9btJdu36Yq2hdG1_bjGW_ilcbkRpatTH04_s LlILBZwjw82XpdhGh8Nnt_vnKg7Pn.W8vuCQXfy03iH.oOXHeBfNdOBe5qG2 EH5WGUKM5Hw1y8l01t6qj6prrC8lbRd1NQpPwduQXFgXGM8m1Kqbua8OK2Rk hdImPSZIFhP26RjdKepOIzQTA6KMqrCKl.B.frz4- X-Yahoo-SMTP: clhABp.swBB7fs.LwIJpv3jkWgo2NU8- Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 From: Scott Long In-Reply-To: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> Date: Mon, 25 Apr 2016 09:32:30 -0600 Cc: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> To: Dan Langille X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 15:35:41 -0000 Hi Dan, Can you share the entire console log for the uptime? What you=E2=80=99ve = pasted is missing the initial messages of the problem. the = =E2=80=9Cterminated ioc=E2=80=9D messages are likely because the driver = has decided to reset the drive and terminate all outstanding I/O to it. = In other words, they=E2=80=99re red herrings. The reason for the driver = deciding to do the reset is likely earlier in the log. Thanks, Scott > On Apr 24, 2016, at 7:35 AM, Dan Langille wrote: >=20 > More of the pasted output is also at = https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae = and = added smartctl output. >=20 > I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008 = card. It's running a zfs root system. >=20 > This morning the system was unresponsive via ssh. Attempts to log in = at the console did not yield a password prompt. >=20 > A power cycle brought the system online. Inspecting = /var/log/messages, I found about 63,000 entries similar to those which = appear below. >=20 > zpool status of all are OK. A scrub is in progress for one pool (since = before this issue arose). da7 is in that pool. >=20 >=20 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 = 8b 0f a2 48 00 00 18 00 length 12288 SMID 177 terminated ioc 804b scsi 0 = state c xfer 12288 > Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 376 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 = state c xfer 0 >=20 > Is this a cabling issue? The drive is a SATA device (smartctl output = in the URL above). Anyone familiar with these errors? >=20 > --=20 > Dan Langille - BSDCan / PGCon > dan@langille.org >=20 >=20 >=20 >=20 > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Mon Apr 25 15:36:16 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8C95AB1C9E0 for ; Mon, 25 Apr 2016 15:36:16 +0000 (UTC) (envelope-from dan@langille.org) Received: from clavin1.langille.org (clavin.langille.org [162.208.116.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "clavin.langille.org", Issuer "StartCom Class 2 Primary Intermediate Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6E9DB1E25 for ; Mon, 25 Apr 2016 15:36:15 +0000 (UTC) (envelope-from dan@langille.org) Received: from (clavin1.int.langille.org (clavin1.int.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) with ESMTPSA id D42DD5DF ; Mon, 25 Apr 2016 15:36:11 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 From: Dan Langille In-Reply-To: Date: Mon, 25 Apr 2016 11:36:11 -0400 Cc: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> To: Scott Long X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 15:36:16 -0000 > On Apr 25, 2016, at 11:32 AM, Scott Long wrote: >=20 > Hi Dan, >=20 > Can you share the entire console log for the uptime? What you=E2=80=99v= e pasted is missing the initial messages of the problem. the = =E2=80=9Cterminated ioc=E2=80=9D messages are likely because the driver = has decided to reset the drive and terminate all outstanding I/O to it. = In other words, they=E2=80=99re red herrings. The reason for the driver = deciding to do the reset is likely earlier in the log. Yes, does this help anything? The 'core dumped' messages relate to Bacula regression testing. I don't = think there is anything helpful here for you: Apr 13 07:59:52 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 13 12:06:14 knew kernel: pid 57706 (bacula-sd), uid 1001: exited on = signal 11 (core dumped) Apr 13 15:17:42 knew sshd[31059]: fatal: Read from socket failed: = Connection reset by peer [preauth] Apr 14 07:23:05 knew kernel: sonewconn: pcb 0xfffff8035dd21dc8: Listen = queue overflow: 8 already in queue awaiting acceptance (1 occurrences) Apr 16 12:54:07 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 17 03:19:05 knew kernel: pid 38425 (bacula-sd), uid 1001: exited on = signal 11 (core dumped) Apr 17 06:43:26 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 17 06:55:53 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 17 09:21:16 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 19 18:12:19 knew kernel: (sa1:mps0:0:0:0): 64512-byte tape record = bigger than supplied buffer Apr 20 14:03:05 knew su: BAD SU dan to root on /dev/pts/2 Apr 20 14:03:11 knew last message repeated 2 times Apr 20 14:03:15 knew su: dan to root on /dev/pts/2 Apr 20 18:52:14 knew kernel: (sa1:mps0:0:0:0): 64512-byte tape record = bigger than supplied buffer Apr 21 08:10:52 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 23 03:56:28 knew kernel: pid 80961 (bacula-fd), uid 1002: exited on = signal 11 (core dumped) Apr 23 09:41:33 knew kernel: pid 51735 (bacula-sd), uid 1002: exited on = signal 11 (core dumped) Apr 24 05:14:46 knew kernel: pid 4529 (bacula-dir), uid 1002: exited on = signal 11 (core dumped) Apr 24 07:22:09 knew kernel: (sa0:sym0:0:1:0): 64512-byte tape record = bigger than supplied buffer Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8d = 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 = state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b = d9 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 = state c xfer 0 It includes three lines which appear below >=20 > Thanks, > Scott >=20 >> On Apr 24, 2016, at 7:35 AM, Dan Langille wrote: >>=20 >> More of the pasted output is also at = https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae = and = added smartctl output. >>=20 >> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI = SAS2008 card. It's running a zfs root system. >>=20 >> This morning the system was unresponsive via ssh. Attempts to log in = at the console did not yield a password prompt. >>=20 >> A power cycle brought the system online. Inspecting = /var/log/messages, I found about 63,000 entries similar to those which = appear below. >>=20 >> zpool status of all are OK. A scrub is in progress for one pool = (since before this issue arose). da7 is in that pool. >>=20 >>=20 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 = 8b 0f a2 48 00 00 18 00 length 12288 SMID 177 terminated ioc 804b scsi 0 = state c xfer 12288 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 376 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 = state c xfer 0 >>=20 >> Is this a cabling issue? The drive is a SATA device (smartctl output = in the URL above). Anyone familiar with these errors? >>=20 >> --=20 >> Dan Langille - BSDCan / PGCon >> dan@langille.org >>=20 >>=20 >>=20 >>=20 >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >=20 From owner-freebsd-scsi@freebsd.org Mon Apr 25 15:39:55 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D9AD9B1CAAF for ; Mon, 25 Apr 2016 15:39:55 +0000 (UTC) (envelope-from dan@langille.org) Received: from clavin2.langille.org (clavin2.langille.org [199.233.228.197]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "clavin.langille.org", Issuer "StartCom Class 2 Primary Intermediate Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BB8F91FEC for ; Mon, 25 Apr 2016 15:39:55 +0000 (UTC) (envelope-from dan@langille.org) Received: from (clavin2.int.langille.org (clavin2.int.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) with ESMTPSA id BD73E18440 for ; Mon, 25 Apr 2016 15:39:47 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 From: Dan Langille In-Reply-To: <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> Date: Mon, 25 Apr 2016 11:39:46 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> To: freebsd-scsi@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 15:39:55 -0000 > On Apr 25, 2016, at 8:17 AM, Dan Langille wrote: >=20 >>=20 >> On Apr 24, 2016, at 9:35 AM, Dan Langille wrote: >>=20 >> More of the pasted output is also at = https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae = and = added smartctl output. >>=20 >> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI = SAS2008 card. It's running a zfs root system. >>=20 >> This morning the system was unresponsive via ssh. Attempts to log in = at the console did not yield a password prompt. >>=20 >> A power cycle brought the system online. Inspecting = /var/log/messages, I found about 63,000 entries similar to those which = appear below. >>=20 >> zpool status of all are OK. A scrub is in progress for one pool = (since before this issue arose). da7 is in that pool. >>=20 >>=20 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 614 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 792 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 = 8b 0f a2 48 00 00 18 00 length 12288 SMID 177 terminated ioc 804b scsi 0 = state c xfer 12288 >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 70 00 00 20 00 length 16384 SMID 376 terminated ioc 804b scsi 0 = state c xfer 0 >> Apr 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 = state c xfer 0 >>=20 >> Is this a cabling issue? The drive is a SATA device (smartctl output = in the URL above). Anyone familiar with these errors? >=20 > This morning: >=20 > 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 >=20 > At least I know i'm looking for Serial Number: 13Q8PNBYS >=20 > =46rom the logs: >=20 > Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0 > Apr 25 05:34:50 knew kernel: da7: s/n = 13Q8PNBYS detached > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0 = state c xfer 0 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = d8 33 26 f8 00 00 20 00 length 16384 SMID 204 terminated ioc 804b scsi 0 = state c xfer(da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 53 e0 00 00 08 = 00=20 > Apr 25 05:34:51 knew kernel: 0 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi 0 = state c xfer(da7: 0 > Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). = CDB: 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc = 804b scsi 0 state c xfer 17:0 > Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): WRITE(10). = CDB: 2a 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc = 804b scsi 0 state c xfeError 5, Periph was invalidated > Apr 25 05:34:51 knew kernel: r 0 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = d8 33 26 f8 00 00 20 00=20 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 4d ac ed b8 00 00 08 00 length 4096 SMID 435 terminated ioc 804b scsi 0 = state c xfer (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue = Request > Apr 25 05:34:51 knew kernel: 0 > Apr 25 05:34:51 knew kernel: (da7:mps1: mps1:0:IOCStatus =3D 0x4b = while resetting device 0xa > Apr 25 05:34:51 knew kernel: 17:mps1: 0): Unfreezing devq for target = ID 17 > Apr 25 05:34:51 knew kernel: Error 5, Periph was invalidated > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = d8 33 26 d8 00 00 20 00=20 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = e6 6c 42 40 00 00 10 00=20 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 = e4 d8 2a 90 00 00 90 00=20 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 4d ac ed b8 00 00 08 00=20 > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally Re-queue Request > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, Periph was = invalidated > Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device swap: provider da7p2 = disconnected. > Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS = 'vdev is removed, pool_guid=3D15378250086669402288 = vdev_guid=3D13410079654596185797'' > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed > Apr 25 05:34:51 knew ZFS: vdev is removed, = pool_guid=3D15378250086669402288 vdev_guid=3D13410079654596185797 Current status: after powering off the box, reseating the cables for = that drive, I powered up the system and a resilver commenced which = completed in 30 minutes. Seems OK now. I am not sure if the two events are related. --=20 Dan Langille - BSDCan / PGCon dan@langille.org From owner-freebsd-scsi@freebsd.org Mon Apr 25 16:38:46 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1F11DB1CAC3 for ; Mon, 25 Apr 2016 16:38:46 +0000 (UTC) (envelope-from stephen.mcconnell@broadcom.com) Received: from mail-pf0-x229.google.com (mail-pf0-x229.google.com [IPv6:2607:f8b0:400e:c00::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EE4441F9B for ; Mon, 25 Apr 2016 16:38:45 +0000 (UTC) (envelope-from stephen.mcconnell@broadcom.com) Received: by mail-pf0-x229.google.com with SMTP id n1so70522112pfn.2 for ; Mon, 25 Apr 2016 09:38:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:to:references:in-reply-to:subject:date:message-id:mime-version :content-transfer-encoding:thread-index:content-language; bh=gtvsP/PRtyEWdBckYTQWmi311NvpJi+S6bsgV9uGNws=; b=DEHnjcFBrfC4ZNqcbBcxzLPXcnUMniCjB2gjA1VLHliiTDj5jvuj3/GaakZon192CO K5TLFh6lrlpn4jceqY2WiRfLmKfZCOnRzBaABkB3n6/CEMF2C2TmbrHkH+U1/D5tlC9i jMe4tQeO8FAqnOdS4QcAXe0hlzhrd3fO8MNKg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:references:in-reply-to:subject:date :message-id:mime-version:content-transfer-encoding:thread-index :content-language; bh=gtvsP/PRtyEWdBckYTQWmi311NvpJi+S6bsgV9uGNws=; b=Pm2VG2DVd+S43licRMT97gRVtijYuyNE7nMMiUB6s5xi9IFUbkr3c5oZnpafvZSAE9 mWCMCwYs6YvKwx3TfFoy5GPCAhnbGp5JgSmjYowsOGRiL6XjjKvNk3p801ZxelHte94m 9lH9/Gl5XQTjMxDmRVgnLsOkTLg7BXbcGahJahcLYC2HApYOn4os5SmJC/+1sodC0HPO mBTtnMszHcUEUINVAhBNZlA49t3mz41nNeOgYQVor8iGhhN/iQOGMmpS8OEtxRSHWn9E pjOl0DcS0kY94+8XppzEpwM0M//Thb5zYXGqDl6JsQnWagP96IqE7Tdzdzl/arxpixXe hTCQ== X-Gm-Message-State: AOPr4FWwwP9uMhysxHf8Uv2ztokLvSQrWzZmUv2r/BrDddaLoUnVIHO1G+BXNIwF0y984+nl X-Received: by 10.98.35.12 with SMTP id j12mr17614594pfj.73.1461602325005; Mon, 25 Apr 2016 09:38:45 -0700 (PDT) Received: from C5SDN12 ([192.19.220.253]) by smtp.gmail.com with ESMTPSA id d13sm29381315pfd.80.2016.04.25.09.38.42 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Apr 2016 09:38:43 -0700 (PDT) From: "Stephen McConnell" To: "'Dan Langille'" , References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> In-Reply-To: <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> Subject: RE: terminated ioc 804b scsi 0 state c xfer 0 Date: Mon, 25 Apr 2016 10:38:41 -0600 Message-ID: <0d7401d19f10$ee329300$ca97b900$@broadcom.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQJ4ucc4xV6h5VXTBEGxwBGwzw4I/wFGbfNtAl4WzZeeL4AMwA== Content-Language: en-us X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 16:38:46 -0000 > -----Original Message----- > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > scsi@freebsd.org] On Behalf Of Dan Langille > Sent: Monday, April 25, 2016 9:40 AM > To: freebsd-scsi@freebsd.org > Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 > > > On Apr 25, 2016, at 8:17 AM, Dan Langille wrote: > > > >> > >> On Apr 24, 2016, at 9:35 AM, Dan Langille wrote: > >> > >> More of the pasted output is also at > https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae > > and added smartctl output. > >> > >> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008 > card. It's running a zfs root system. > >> > >> This morning the system was unresponsive via ssh. Attempts to log in at > the console did not yield a password prompt. > >> > >> A power cycle brought the system online. Inspecting /var/log/messages, I > found about 63,000 entries similar to those which appear below. > >> > >> zpool status of all are OK. A scrub is in progress for one pool (since before > this issue arose). da7 is in that pool. > >> > >> > >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > >> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi > >> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): > >> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614 > >> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew > >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 20 > >> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0 > >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > >> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b > >> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): > >> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 > >> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew > >> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 18 > >> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer > >> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: > >> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc > >> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: > >> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 > >> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr > >> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b > >> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 > >> state c xfer 0 > >> > >> Is this a cabling issue? The drive is a SATA device (smartctl output in the > URL above). Anyone familiar with these errors? > > > > This morning: > > > > 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 > > > > At least I know i'm looking for Serial Number: 13Q8PNBYS > > > > From the logs: > > > > Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0 > > Apr 25 05:34:50 knew kernel: da7: s/n > 13Q8PNBYS detached > > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > > d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0 > > state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): > > READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 > > terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB: 28 > 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51 > knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue > Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 > 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi 0 > state c xfer(da7: 0 > > Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). CDB: > 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b scsi 0 > state c xfer 17:0 > > Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): WRITE(10). CDB: 2a > 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scsi 0 > state c xfeError 5, Periph was invalidated > > Apr 25 05:34:51 knew kernel: r 0 > > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > > d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: > > (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length > > 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer > > (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25 > > 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: > > mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51 > > knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 > > 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51 > > knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00 > > 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 > > knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00 > > 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 > > knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00 > > 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew > kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 Apr > 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re- > queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, > Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device > swap: provider da7p2 disconnected. > > Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev is > removed, pool_guid=15378250086669402288 > vdev_guid=13410079654596185797'' > > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr > > 25 05:34:51 knew ZFS: vdev is removed, pool_guid=15378250086669402288 > > vdev_guid=13410079654596185797 > > Current status: after powering off the box, reseating the cables for that drive, I > powered up the system and a resilver commenced which completed in 30 > minutes. > > Seems OK now. I am not sure if the two events are related. Recently, a bug was uncovered where a device is gets 'lost'. Here's what happens: The first message in your "failure on Monday" log is for 'mpssas_prepare_remove'. This message is likely logged because the FW sends an event to the driver that the device is no longer responsive (pulled, cable issue, or something else). When the driver gets this event, it sends a reset to the device to clear out any pending I/O. This is where the 'terminated ioc' messages come from. When the reset completes, the driver is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle for that disk is removed from FW's list. Then, when the device comes back on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT message is sent to FW, the driver exits the function where that happens (mpssas_remove_device). This happens where you see the log message, "IOCStatus - 0x4b while resetting device 0x0a". The driver logs that message and then exits. What the driver should do is log that message and continue on to send the SAS_IO_UNIT message to FW. The fix is to remove the two lines in the driver shown here with '>>': if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) { mps_printf(sc, "IOCStatus = 0x%x while resetting device 0x%x\n", le16toh(reply->IOCStatus), handle); >> mpssas_free_tm(sc, tm); >> return; } A reboot will solve the problem, as you saw, but the real fix is to remove the DevHandle as described above. This fix will go into the driver before the next scheduled release and then MFC'd to 10.x. Steve > > -- > Dan Langille - BSDCan / PGCon > dan@langille.org > > > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Mon Apr 25 17:32:57 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D1B87B1C9CB for ; Mon, 25 Apr 2016 17:32:57 +0000 (UTC) (envelope-from scott4long@yahoo.com) Received: from nm23-vm8.bullet.mail.gq1.yahoo.com (nm23-vm8.bullet.mail.gq1.yahoo.com [98.136.217.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ABC9410EE for ; Mon, 25 Apr 2016 17:32:57 +0000 (UTC) (envelope-from scott4long@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1461605445; bh=1eyZoYwWqR4aQQTicpSZIZi7szLhyC4e/5ozQwpoDaE=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject; b=CFeELlvO7riq5wmalhiae2VJ7B4KyqT6buLnmXySTT2Fo3k4uYKUQApGDtyvJtTK96nwKi2Xz0FqsAI21HcZ0wYvgrNpp6Jb1JMIj4ff3SKRoVKRfbDeNyqHZcmHvRoYxK5jmjBLPGRYbTteQxXgnHXosisXhbL8kmgIGQf6vhLmKwqIAXQSu7cjEf0OkWtFU5uC32y6DwwUwgvWdyhmPaBRkU44BZxfP0P5h1n0IbWx+So3Dnptkw7Qh1FeI8JfpFTOSXS4Kmr0fosVdzvO+zXg2IkzyX6F+CoyIGYyZ7QtU//n6/D4ghKqHzAw+ChtGddFNdNpsLjmK5q8Hs5B0g== Received: from [98.137.12.189] by nm23.bullet.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 17:30:45 -0000 Received: from [208.71.42.209] by tm10.bullet.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 17:30:45 -0000 Received: from [127.0.0.1] by smtp220.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 17:30:45 -0000 X-Yahoo-Newman-Id: 808136.73661.bm@smtp220.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: s1H13asVM1nR1lLpQ5XuVePBuoYy199BnDoHWphKCUNsqwP iC.LaxvB.X6fbTAxEZY09SutUwk94UPvN4.h.xkkm5kzmYbCwDaW2qjktMkC ETUBvpudXKaV721oxH4sotHdqlnR5gH2FPiGufHIynbo9MqUB8gsjS13h0eo nu73iCxHntNLL40y9PARobH.eYZU0J.6Fw27tp48uz_dKItp2LUfoaZBUo7h Po0mQeJf4iEShl0Y5zEs4Dpt.U3IB4TJORL7cg_QwP79wdolLbO9cO7tfJuF erW5NV3qZN12FHEQG0VHvqSfGHGAe4wIM6BOGa2JCd4VIxZBfsMs7ZAcs51O u7J2OZ3ml7B6dgNFvsgzXnheA9xWCx8ikSJEu5c3dAUOvxCwKgFCn92q6jQI 6wL9oPgTDfea.UP8FO4godlUXHpLaqCMDSOCti1IG.xNQmE8H7rrkcheRzOO 8FsUMv9UvmTTd4akJQDwDloQAkqsO9Ca7S2co1Zp9_7T0.Gar9b8IkNtdfCc jkdN0rZZPPZy.I5Ll5j0cm7gHcIJ9cJHFOEsQsy0- X-Yahoo-SMTP: clhABp.swBB7fs.LwIJpv3jkWgo2NU8- Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 From: Scott Long In-Reply-To: <0d7401d19f10$ee329300$ca97b900$@broadcom.com> Date: Mon, 25 Apr 2016 11:30:41 -0600 Cc: Dan Langille , freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> To: Stephen McConnell X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 17:32:57 -0000 > On Apr 25, 2016, at 10:38 AM, Stephen McConnell via freebsd-scsi = wrote: >=20 >=20 >=20 >> -----Original Message----- >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >> scsi@freebsd.org] On Behalf Of Dan Langille >> Sent: Monday, April 25, 2016 9:40 AM >> To: freebsd-scsi@freebsd.org >> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 >>=20 >>> On Apr 25, 2016, at 8:17 AM, Dan Langille wrote: >>>=20 >>>>=20 >>>> On Apr 24, 2016, at 9:35 AM, Dan Langille wrote: >>>>=20 >>>> More of the pasted output is also at >> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae >> >> and added smartctl output. >>>>=20 >>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI = SAS2008 >> card. It's running a zfs root system. >>>>=20 >>>> This morning the system was unresponsive via ssh. Attempts to log = in at >> the console did not yield a password prompt. >>>>=20 >>>> A power cycle brought the system online. Inspecting = /var/log/messages, > I >> found about 63,000 entries similar to those which appear below. >>>>=20 >>>> zpool status of all are OK. A scrub is in progress for one pool = (since > before >> this issue arose). da7 is in that pool. >>>>=20 >>>>=20 >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 = 00 >>>> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b = scsi >>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): >>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614 >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 = 20 >>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0 >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 = 00 >>>> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b >>>> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: = (da7:mps1:0:17:0): >>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 = 18 >>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer >>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). = CDB: >>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 >>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr >>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b >>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi = 0 >>>> state c xfer 0 >>>>=20 >>>> Is this a cabling issue? The drive is a SATA device (smartctl = output > in the >> URL above). Anyone familiar with these errors? >>>=20 >>> This morning: >>>=20 >>> 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 >>>=20 >>> At least I know i'm looking for Serial Number: 13Q8PNBYS >>>=20 >>> =46rom the logs: >>>=20 >>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun = 0 >>> Apr 25 05:34:50 knew kernel: da7: s/n >> 13Q8PNBYS detached >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi = 0 >>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): >>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 >>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). = CDB: > 28 >> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 = 05:34:51 >> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue >> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). = CDB: 28 >> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b = scsi > 0 >> state c xfer(da7: 0 >>> Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). > CDB: >> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc = 804b > scsi 0 >> state c xfer 17:0 >>> Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): = WRITE(10). > CDB: 2a >> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b = scsi > 0 >> state c xfeError 5, Periph was invalidated >>> Apr 25 05:34:51 knew kernel: r 0 >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 = length >>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer >>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr = 25 >>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: >>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 05:34:51 >>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 >>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 = 05:34:51 >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 = 00 >>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 = 00 >>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00 >>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 = knew >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 = 00 > Apr >> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally > Re- >> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error = 5, >> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: = Device >> swap: provider da7p2 disconnected. >>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS = 'vdev > is >> removed, pool_guid=3D15378250086669402288 >> vdev_guid=3D13410079654596185797'' >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr >>> 25 05:34:51 knew ZFS: vdev is removed, = pool_guid=3D15378250086669402288 >>> vdev_guid=3D13410079654596185797 >>=20 >> Current status: after powering off the box, reseating the cables for = that > drive, I >> powered up the system and a resilver commenced which completed in 30 >> minutes. >>=20 >> Seems OK now. I am not sure if the two events are related. >=20 > Recently, a bug was uncovered where a device is gets 'lost'. >=20 > Here's what happens: > The first message in your "failure on Monday" log is for > 'mpssas_prepare_remove'. This message is likely logged because the FW = sends > an event to the driver that the device is no longer responsive = (pulled, > cable issue, or something else). When the driver gets this event, it = sends > a reset to the device to clear out any pending I/O. This is where the > 'terminated ioc' messages come from. When the reset completes, the = driver > is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle = for > that disk is removed from FW's list. Then, when the device comes back > on-line, everything is fine. But, with this bug, before that = SAS_IO_UNIT > message is sent to FW, the driver exits the function where that = happens > (mpssas_remove_device). This happens where you see the log message, > "IOCStatus - 0x4b while resetting device 0x0a". The driver logs that > message and then exits. What the driver should do is log that message = and > continue on to send the SAS_IO_UNIT message to FW. The fix is to = remove the > two lines in the driver shown here with '>>': >=20 > if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) { > mps_printf(sc, "IOCStatus =3D 0x%x while resetting = device > 0x%x\n", > le16toh(reply->IOCStatus), handle); >>> mpssas_free_tm(sc, tm); >>> return; > } >=20 > A reboot will solve the problem, as you saw, but the real fix is to = remove > the DevHandle as described above. This fix will go into the driver = before > the next scheduled release and then MFC'd to 10.x. >=20 Thanks for the diagnosis, Steve. I forgot about that case. We should = also make this chain of events more evident in the syslog, it=E2=80=99s very = confusing when it happens. I=E2=80=99m not exactly sure yet what it should look like. Scott