From owner-freebsd-stable@FreeBSD.ORG Tue Jul 9 15:56:56 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8E8C648F for ; Tue, 9 Jul 2013 15:56:56 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net [217.70.183.197]) by mx1.freebsd.org (Postfix) with ESMTP id 3012519B3 for ; Tue, 9 Jul 2013 15:56:56 +0000 (UTC) Received: from mfilter26-d.gandi.net (mfilter26-d.gandi.net [217.70.178.154]) by relay5-d.mail.gandi.net (Postfix) with ESMTP id 7CEC741C084; Tue, 9 Jul 2013 17:56:45 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mfilter26-d.gandi.net Received: from relay5-d.mail.gandi.net ([217.70.183.197]) by mfilter26-d.gandi.net (mfilter26-d.gandi.net [10.0.15.180]) (amavisd-new, port 10024) with ESMTP id QRiK40EPqckO; Tue, 9 Jul 2013 17:56:43 +0200 (CEST) X-Originating-IP: 76.102.14.35 Received: from jdc.koitsu.org (c-76-102-14-35.hsd1.ca.comcast.net [76.102.14.35]) (Authenticated sender: jdc@koitsu.org) by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id 4056341C076; Tue, 9 Jul 2013 17:56:43 +0200 (CEST) Received: by icarus.home.lan (Postfix, from userid 1000) id 7204773A31; Tue, 9 Jul 2013 08:56:41 -0700 (PDT) Date: Tue, 9 Jul 2013 08:56:41 -0700 From: Jeremy Chadwick To: Outback Dingo Subject: Re: Stable/9 from today mpssas_scsiio timeouts Message-ID: <20130709155641.GA9350@icarus.home.lan> References: <20130709123900.GA5828@icarus.home.lan> <20130709144614.GA7538@icarus.home.lan> <20130709153058.GA8769@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jul 2013 15:56:56 -0000 On Tue, Jul 09, 2013 at 11:46:24AM -0400, Outback Dingo wrote: > On Tue, Jul 9, 2013 at 11:30 AM, Jeremy Chadwick wrote: > > > On Tue, Jul 09, 2013 at 11:20:45AM -0400, Outback Dingo wrote: > > > On Tue, Jul 9, 2013 at 10:46 AM, Jeremy Chadwick wrote: > > > > > > > On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote: > > > > > On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo < > > outbackdingo@gmail.com > > > > >wrote: > > > > > > On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick > > > > wrote: > > > > > > > > > > > >> On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote: > > > > > >> > as of stable today im seeing alot of new mps time outs > > > > > >> > > > > > > >> > 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul 8 16:34:28 > > UTC > > > > 2013 > > > > > >> > root@:/usr/obj/nas/usr/src/sys/ > > > > > >> > > > > > > >> > mps1@pci0:130:0:0: class=0x010700 card=0x30201000 > > > > chip=0x00721000 > > > > > >> > rev=0x03 hdr=0x00 > > > > > >> > vendor = 'LSI Logic / Symbios Logic' > > > > > >> > device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]' > > > > > >> > class = mass storage > > > > > >> > subclass = SAS > > > > > >> > > > > > > >> > > > > > > >> > mps0: mpssas_scsiio_timeout checking sc 0xffffff8002145000 cm > > > > > >> > 0xffffff80021a6b78 > > > > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 > > > > SMID > > > > > >> 983 > > > > > >> > command timeout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800 > > > > > >> > mps0: mpssas_alloc_tm freezing simq > > > > > >> > mps0: timedout cm 0xffffff80021a6b78 allocated tm > > 0xffffff80021587b0 > > > > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 > > > > SMID > > > > > >> 983 > > > > > >> > completed timedout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800 > > > > during > > > > > >> > recovery ioc 8048 scsi 0 state c xfer 0 > > > > > >> > (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a > > code > > > > 0x0 > > > > > >> count > > > > > >> > 1 > > > > > >> > (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting > > > > TaskMID > > > > > >> 983 > > > > > >> > mps0: mpssas_free_tm releasing simq > > > > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 > > > > > >> > (probe40:mps0:0:40:0): CAM status: Command timeout > > > > > >> > (probe40:mps0:0:40:0): Retrying command > > > > > >> > mps1: mpssas_scsiio_timeout checking sc 0xffffff8002384000 cm > > > > > >> > 0xffffff80023e5b78 > > > > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length > > 36 > > > > SMID > > > > > >> 983 > > > > > >> > command timeout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800 > > > > > >> > mps1: mpssas_alloc_tm freezing simq > > > > > >> > mps1: timedout cm 0xffffff80023e5b78 allocated tm > > 0xffffff80023977b0 > > > > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length > > 36 > > > > SMID > > > > > >> 983 > > > > > >> > completed timedout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800 > > > > during > > > > > >> > recovery ioc 8048 scsi 0 state c xfer 0 > > > > > >> > (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a > > code > > > > 0x0 > > > > > >> count > > > > > >> > 1 > > > > > >> > (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting > > > > TaskMID > > > > > >> 983 > > > > > >> > mps1: mpssas_free_tm releasing simq > > > > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 > > > > > >> > (probe292:mps1:0:37:0): CAM status: Command timeout > > > > > >> > (probe292:mps1:0:37:0): Retrying command > > > > > >> > > > > > >> 1. What revision were you running before (i.e. what were you on > > prior > > > > to > > > > > >> the upgrade)? > > > > > >> > > > > > > > > > > > > > > > > > > Sorry I was on 252595 from July 3 > > > > > > > > And does rolling back to r252595 resolve the problem for you? > > > > > > > > Because the only commit I see between r253035 and r252595 that might > > > > account for some kind of behavioural change, unless I missed one while > > > > skimming the commit history, is the following: > > > > > > > > r252730 -- http://www.freshbsd.org/commit/freebsd/r252730 > > > > > > > > If at all possible, please try updating to r253037 or newer to see > > > > if that has some effect/improvement. Why I mention that commit: > > > > > > > > r253037 -- http://www.freshbsd.org/commit/freebsd/r253037 > > > > > > > > Because the only mps(4) changes done in recent days are: > > > > > > > > http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log > > > > > > > > r253037 > > > > r251899 > > > > r251874 > > > > > > > > > > i can say this its between July 4, and 253048, im rolling back to 252723 > > to > > > validate a good known working state > > > > Looking at your dmesg, it looks like the "errors" might be for SAS ports > > which don't have any actual devices (disks) attached to them, yet parts > > of the kernel (not sure which layer) are still trying to submit INQUIRY > > commands to those ports as if they did have disks attached. > > > > It looks like you see this behaviour on boot up, and then later during > > normal operation at some point (a LUN scan or rescan or "bus taste" > > might cause this to happen; for example I know that "zpool import" in > > effect can sometimes cause this behaviour -- on one of my systems "zpool > > import" would cause the servers' floppy drive to spin up/chunk briefly). > > > > I'm hoping Steven or mav@ might be able to confirm/deny my theory here. > > > > I see it even trying to write to the pool via NFS or FTP, which even times > out on large files > now, it was all working, and there are 2 controllers setup in an HA > configuration, but they did > work fine before, so ill roll back and try an earlier kernel then walk > forward till i hit the problem. > my only issue was i moved forward to get the newer ixgbe driver and others > just commited to stable > then to find that SAS was now quirky, welcome to stable. Either way the > overall performance > on this box has been in question, just havent been able to confirm its the > enclosure, the nic card, > or the zpool which is degraded, but 40MB/s via NFS on a 10GBe nic isnt > good. so tweaking and > testing seems to be mute until the box is at least stable again. I do > appreciate the insight, and will > do whatevers needed to hammer down the issue so it can be resolved. Again, I would strongly suggest trying r253037 or newer first. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |