Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 May 2011 21:47:37 -0600
From:      "Kenneth D. Merry" <ken@FreeBSD.org>
To:        Dmitry Morozovsky <marck@rinet.ru>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: mps driver instability under stable/8
Message-ID:  <20110503034737.GA52416@nargothrond.kdm.org>
In-Reply-To: <alpine.BSF.2.00.1105011434360.29081@woozle.rinet.ru>
References:  <alpine.BSF.2.00.1104291145080.29081@woozle.rinet.ru> <20110430211927.GA67374@nargothrond.kdm.org> <alpine.BSF.2.00.1105011434360.29081@woozle.rinet.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, May 01, 2011 at 14:42:21 +0400, Dmitry Morozovsky wrote:
> On Sat, 30 Apr 2011, Kenneth D. Merry wrote:
> 
> KDM> On Fri, Apr 29, 2011 at 11:51:21 +0400, Dmitry Morozovsky wrote:
> KDM> > Dear Ken,
> KDM> > 
> KDM> > I have SuperMicro Server with mps driver you managed, with 24 SATA disks under 
> KDM> > SAS x36 expander with large ZFS
> KDM> > 
> KDM> > Sometimes, under random disk load such as daily find, it lost all its devices:
> KDM> > 
> KDM> > [-- MARK -- Fri Apr 29 03:00:00 2011]
> KDM> > mps0: IOC Fault 0x40005900, Resetting^M
> KDM> > (pass20:mps0:0:22:0): SCSI command timeout on device handle 0x0020 SMID 442^M
> KDM> > mps0: IOC Fault 0x40001500, Resetting^M
> KDM> > (da19:mps0:0:21:0): SCSI command timeout on device handle 0x001f SMID 172^M
> KDM> > (da19:mps0:0:21:0): SCSI command timeout on device handle 0x001f SMID 511^M
> KDM> > (da20:mps0:0:20:0): SCSI command timeout on device handle 0x001e SMID 240^M
> KDM> > 
> KDM> > ..
> KDM> > 
> KDM> > (da4:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 844^M
> KDM> > (da22:mps0:0:23:0): SCSI command timeout on device handle 0x0021 SMID 713^M
> KDM> > (da18:mps0:0:22:0): SCSI command timeout on device handle 0x0020 SMID 603^M
> KDM> > 
> KDM> > and hangs there forever (in zio state).
> KDM> > 
> KDM> > I've prepared debugging kernel with DDB and would be glad to help catch the 
> KDM> > situation.
> KDM> 
> KDM> Hmm...
> KDM> 
> KDM> Can you send full dmesg output?
> 
> Attached

Thanks.

It looks like you have a SAS2008, with the 4.0 firmware.  I think it would
be worthwhile to upgrade to the 9.0 firmware.  I know for sure there are
issues with the 2.0 firmware, and I know the 9.0 firmware works fairly
well.  I don't know whether the 4.0 firmware has any severe issues, but it
would be good to eliminate firmware bugs before we chase driver issues.

> KDM>  What I'm most interested in is whether
> KDM> there is more kernel output before the IOC Fault that might shed some light
> KDM> on what is going on.
> 
> Nope. I use boot_verbose, but none of mps-related debug options yet

Okay.  If there's nothing before the IOC fault message, then we really
don't have any clues to what caused the fault...

The rest is just fallout from the IOC fault.

> KDM> 
> KDM> Also, what brand (LSI, Maxim, etc.) and speed (3Gb, 6Gb) is the expander on
> KDM> the backplane?
> 
> LSI 6G:

Okay.

> KDM> What model LSI controller do you have?  How many lanes are connected
> KDM> between the controller and the backplane?
> 
> 2x4 IIR. BTW, how can investigate real SASA topology?

So 8 lanes total?  That's what I wanted to know.  The primary thing I'm
getting at is to see how much lane contention we may have.

With 24 SATA disks, you can only talk to 8 at a time with 8 lanes connected
from the controller to the backplane.

I've run into issues with a lot of contention with SATA drives, but that
was with a 3Gb Maxim expander.  In theory things should work better with an
LSI expander.  (You would think that they test scenarios like yours.)

> KDM> What model disks do you have in the system?  (dmesg will show that
> KDM> obviously.)
> 
> 24 x WD RE4 2T

Ok.  My SATA testing has been primarily with WD 2TB drives as well.

> KDM> Hopefully we can find some clues to point to the problem.
> 
> /me too ;)
> 
> Thank you very much!
> 
>  BTW, I have serial console, DDB kernel, so while this machine is in 
> production, but not too heavy, and I can spend some time in kernel debugger if 
> needed.

Well, I think the first thing to do is upgrade the firmware and see if that
fixes it.

If not, we'll start instrumenting things and see how much information we
can get about the cause of the fault.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110503034737.GA52416>