From owner-freebsd-stable@FreeBSD.ORG Tue May 3 03:47:39 2011 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 75C49106566B for ; Tue, 3 May 2011 03:47:39 +0000 (UTC) (envelope-from ken@kdm.org) Received: from nargothrond.kdm.org (nargothrond.kdm.org [70.56.43.81]) by mx1.freebsd.org (Postfix) with ESMTP id 3459A8FC0A for ; Tue, 3 May 2011 03:47:38 +0000 (UTC) Received: from nargothrond.kdm.org (localhost [127.0.0.1]) by nargothrond.kdm.org (8.14.2/8.14.2) with ESMTP id p433lcHu052724; Mon, 2 May 2011 21:47:38 -0600 (MDT) (envelope-from ken@nargothrond.kdm.org) Received: (from ken@localhost) by nargothrond.kdm.org (8.14.2/8.14.2/Submit) id p433lbfl052723; Mon, 2 May 2011 21:47:37 -0600 (MDT) (envelope-from ken) Date: Mon, 2 May 2011 21:47:37 -0600 From: "Kenneth D. Merry" To: Dmitry Morozovsky Message-ID: <20110503034737.GA52416@nargothrond.kdm.org> References: <20110430211927.GA67374@nargothrond.kdm.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2i Cc: freebsd-stable@FreeBSD.org Subject: Re: mps driver instability under stable/8 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 03:47:39 -0000 On Sun, May 01, 2011 at 14:42:21 +0400, Dmitry Morozovsky wrote: > On Sat, 30 Apr 2011, Kenneth D. Merry wrote: > > KDM> On Fri, Apr 29, 2011 at 11:51:21 +0400, Dmitry Morozovsky wrote: > KDM> > Dear Ken, > KDM> > > KDM> > I have SuperMicro Server with mps driver you managed, with 24 SATA disks under > KDM> > SAS x36 expander with large ZFS > KDM> > > KDM> > Sometimes, under random disk load such as daily find, it lost all its devices: > KDM> > > KDM> > [-- MARK -- Fri Apr 29 03:00:00 2011] > KDM> > mps0: IOC Fault 0x40005900, Resetting^M > KDM> > (pass20:mps0:0:22:0): SCSI command timeout on device handle 0x0020 SMID 442^M > KDM> > mps0: IOC Fault 0x40001500, Resetting^M > KDM> > (da19:mps0:0:21:0): SCSI command timeout on device handle 0x001f SMID 172^M > KDM> > (da19:mps0:0:21:0): SCSI command timeout on device handle 0x001f SMID 511^M > KDM> > (da20:mps0:0:20:0): SCSI command timeout on device handle 0x001e SMID 240^M > KDM> > > KDM> > .. > KDM> > > KDM> > (da4:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 844^M > KDM> > (da22:mps0:0:23:0): SCSI command timeout on device handle 0x0021 SMID 713^M > KDM> > (da18:mps0:0:22:0): SCSI command timeout on device handle 0x0020 SMID 603^M > KDM> > > KDM> > and hangs there forever (in zio state). > KDM> > > KDM> > I've prepared debugging kernel with DDB and would be glad to help catch the > KDM> > situation. > KDM> > KDM> Hmm... > KDM> > KDM> Can you send full dmesg output? > > Attached Thanks. It looks like you have a SAS2008, with the 4.0 firmware. I think it would be worthwhile to upgrade to the 9.0 firmware. I know for sure there are issues with the 2.0 firmware, and I know the 9.0 firmware works fairly well. I don't know whether the 4.0 firmware has any severe issues, but it would be good to eliminate firmware bugs before we chase driver issues. > KDM> What I'm most interested in is whether > KDM> there is more kernel output before the IOC Fault that might shed some light > KDM> on what is going on. > > Nope. I use boot_verbose, but none of mps-related debug options yet Okay. If there's nothing before the IOC fault message, then we really don't have any clues to what caused the fault... The rest is just fallout from the IOC fault. > KDM> > KDM> Also, what brand (LSI, Maxim, etc.) and speed (3Gb, 6Gb) is the expander on > KDM> the backplane? > > LSI 6G: Okay. > KDM> What model LSI controller do you have? How many lanes are connected > KDM> between the controller and the backplane? > > 2x4 IIR. BTW, how can investigate real SASA topology? So 8 lanes total? That's what I wanted to know. The primary thing I'm getting at is to see how much lane contention we may have. With 24 SATA disks, you can only talk to 8 at a time with 8 lanes connected from the controller to the backplane. I've run into issues with a lot of contention with SATA drives, but that was with a 3Gb Maxim expander. In theory things should work better with an LSI expander. (You would think that they test scenarios like yours.) > KDM> What model disks do you have in the system? (dmesg will show that > KDM> obviously.) > > 24 x WD RE4 2T Ok. My SATA testing has been primarily with WD 2TB drives as well. > KDM> Hopefully we can find some clues to point to the problem. > > /me too ;) > > Thank you very much! > > BTW, I have serial console, DDB kernel, so while this machine is in > production, but not too heavy, and I can spend some time in kernel debugger if > needed. Well, I think the first thing to do is upgrade the firmware and see if that fixes it. If not, we'll start instrumenting things and see how much information we can get about the cause of the fault. Ken -- Kenneth Merry ken@FreeBSD.ORG