From owner-freebsd-current  Fri Mar 24 13:17:19 2000
Delivered-To: freebsd-current@freebsd.org
Received: from lamb.sas.com (lamb.sas.com [192.35.83.8])
	by hub.freebsd.org (Postfix) with ESMTP
	id 5F53537B818; Fri, 24 Mar 2000 13:16:51 -0800 (PST)
	(envelope-from brdean@unx.sas.com)
Received: from mozart (mozart.unx.sas.com [149.173.6.8])
	by lamb.sas.com (8.9.3/8.9.1) with SMTP id QAA10261;
	Fri, 24 Mar 2000 16:16:44 -0500 (EST)
Received: from dean.pc.sas.com by mozart (5.65c/SAS/Domains/5-6-90)
	id AA16153; Fri, 24 Mar 2000 16:16:13 -0500
Received: (from brdean@localhost)
	by dean.pc.sas.com (8.9.3/8.9.1) id QAA36563;
	Fri, 24 Mar 2000 16:16:13 -0500 (EST)
	(envelope-from brdean)
From: Brian Dean <brdean@unx.sas.com>
Message-Id: <200003242116.QAA36563@dean.pc.sas.com>
Subject: Re: AMI MegaRAID lockup? not accepting commands.
In-Reply-To: <200003241954.LAA01357@mass.cdrom.com> from Mike Smith at "Mar 24,
 2000 11:54:32 am"
To: Mike Smith <msmith@FreeBSD.ORG>
Date: Fri, 24 Mar 2000 16:16:13 -0500 (EST)
Cc: mw@kpnqwest.ch, freebsd-current@FreeBSD.ORG
X-Mailer: ELM [version 2.4ME+ PL61 (25)]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Mike Smith wrote:
> > Just recently (this evening), I was able to get our controller to lock
> > up with the latest patch.  Previously, with that patch installed, I
> > must not have been able to tickle the bug just right, and I believe
> > that Mike based his decision to make that mod based on my lack of a
> > lockup, which always happened quickly.  That's what made me think that
> > we'd solved it, but I guess I just got "lucky" on the previous lockups
> > that happened very quickly, making me think it was more easily
> > reproduceable that it actually is.
> 
> I'm not entirely sure about that; I think there are probably several sets 
> of problems here.
> 
> Can you be more specific about "locking up" though?  The "controller 
> wedged" bug is almost certainly not the same as the "lost interrupt" bug.

Here's a snippet of the messages from my syslog file:

[...]
Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands)
Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead
Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2  ident 17  drive 0
Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12  lba 129695792
Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a5d000  length 6144
Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 00007800  nsg 2
Mar 24 12:35:19 cvsstage /kernel: amr0:   1a11e000/4096
Mar 24 12:35:19 cvsstage /kernel: amr0:   1993f000/2048
Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands)
Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead
Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2  ident 17  drive 0
Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12  lba 129826864
Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a4d000  length 6144
Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 00007800  nsg 2
Mar 24 12:35:19 cvsstage /kernel: amr0:   71ce000/4096
Mar 24 12:35:19 cvsstage /kernel: amr0:   402f000/2048
Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands)
Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead
Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2  ident 17  drive 0
Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12  lba 129630256
Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a3d000  length 6144
Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 00007800  nsg 2
Mar 24 12:35:19 cvsstage /kernel: amr0:   1befe000/4096
Mar 24 12:35:19 cvsstage /kernel: amr0:   1869f000/2048
[...]

In a separate lock up, there are no messages to syslog, but all
accesses to the card are hung.  A ps shows my 26 bonnie processes are
in either in 'wswbuf0' or 'biord' (going from memory here, I may not
have the exact state text correct).  This is the one I believe we are
calling the "lost interrupt" bug.

I'm running a patched 3/13 on this machine which I can't readily do a
full cvs update on it.  I believe that 3/13 was before Poul made his
B_READ changes, so I did not incorporate Poul's 1.8 revision for amr.c
(because I assume it would be incorrect to do so without getting all of
his changes throughout the rest of the kernel).  However, I did get
all of your changes at 1.9.

I also incorporated Markus' patch, with the exception that I set maxio
to 253 instead of 127 or 254 like the card reports (thinking possibly
that there was an off by one issue, i.e., 254 available, 0-253).  It
is this kernel that produced the messages above.  Just for sanity's
sake, I'll try Markus' maxio of 127 and verify whether or not my 26
simultaneous bonnie processes can finish without locking it up.

I agree that we are probably chasing more than one problem.  Also, I
don't necessarily think you should back out the "volatile" change;
even though it did not fix this problem, I think it should still be
there.

> > It sounds like Markus may be onto something.
> 
> I'm somewhat corralled here today, but I might get some time to apply his 
> suggestions on Monday, especially if you're happy it works for you as 
> well.

What we're thinking about doing here is that if scaling back the
number of outstanding io requests hides/avoids the problem, then we
may do that here as a temporary fix, especially if we can still get
good performance.  We have the need to get this machine into
production soon.  Ultimately, I'd like to get another card that we can
play with and experiment with a bit more so that we can diagnose the
real cause, and then be able to run the card a full steam.

I am still able to work on this, though, at least for a few days.  One
area I thought about spending some time was where you maintain whether
the card has interrupts enabled or not and based on this info, you
issue commands with the expectation of getting an interrupt back or
use polled mode.  The next thing I was going to check was to review
that part of the code thinking maybe that the software state might
possibly have gotten out of sync with reality at some point.  Also,
I'm open to other suggestions if you think there's a more productive
area I should spend time on.

Thanks for your help on this.  This card has a lot of promise, and the
driver seems to be "so close".  The fix is probably really simple,
it's just eluding us for the moment.

-Brian


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message