From owner-freebsd-current Fri Mar 24 13:17:19 2000 Delivered-To: freebsd-current@freebsd.org Received: from lamb.sas.com (lamb.sas.com [192.35.83.8]) by hub.freebsd.org (Postfix) with ESMTP id 5F53537B818; Fri, 24 Mar 2000 13:16:51 -0800 (PST) (envelope-from brdean@unx.sas.com) Received: from mozart (mozart.unx.sas.com [149.173.6.8]) by lamb.sas.com (8.9.3/8.9.1) with SMTP id QAA10261; Fri, 24 Mar 2000 16:16:44 -0500 (EST) Received: from dean.pc.sas.com by mozart (5.65c/SAS/Domains/5-6-90) id AA16153; Fri, 24 Mar 2000 16:16:13 -0500 Received: (from brdean@localhost) by dean.pc.sas.com (8.9.3/8.9.1) id QAA36563; Fri, 24 Mar 2000 16:16:13 -0500 (EST) (envelope-from brdean) From: Brian Dean Message-Id: <200003242116.QAA36563@dean.pc.sas.com> Subject: Re: AMI MegaRAID lockup? not accepting commands. In-Reply-To: <200003241954.LAA01357@mass.cdrom.com> from Mike Smith at "Mar 24, 2000 11:54:32 am" To: Mike Smith Date: Fri, 24 Mar 2000 16:16:13 -0500 (EST) Cc: mw@kpnqwest.ch, freebsd-current@FreeBSD.ORG X-Mailer: ELM [version 2.4ME+ PL61 (25)] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Mike Smith wrote: > > Just recently (this evening), I was able to get our controller to lock > > up with the latest patch. Previously, with that patch installed, I > > must not have been able to tickle the bug just right, and I believe > > that Mike based his decision to make that mod based on my lack of a > > lockup, which always happened quickly. That's what made me think that > > we'd solved it, but I guess I just got "lucky" on the previous lockups > > that happened very quickly, making me think it was more easily > > reproduceable that it actually is. > > I'm not entirely sure about that; I think there are probably several sets > of problems here. > > Can you be more specific about "locking up" though? The "controller > wedged" bug is almost certainly not the same as the "lost interrupt" bug. Here's a snippet of the messages from my syslog file: [...] Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands) Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2 ident 17 drive 0 Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12 lba 129695792 Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a5d000 length 6144 Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 00007800 nsg 2 Mar 24 12:35:19 cvsstage /kernel: amr0: 1a11e000/4096 Mar 24 12:35:19 cvsstage /kernel: amr0: 1993f000/2048 Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands) Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2 ident 17 drive 0 Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12 lba 129826864 Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a4d000 length 6144 Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 00007800 nsg 2 Mar 24 12:35:19 cvsstage /kernel: amr0: 71ce000/4096 Mar 24 12:35:19 cvsstage /kernel: amr0: 402f000/2048 Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands) Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2 ident 17 drive 0 Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12 lba 129630256 Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a3d000 length 6144 Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 00007800 nsg 2 Mar 24 12:35:19 cvsstage /kernel: amr0: 1befe000/4096 Mar 24 12:35:19 cvsstage /kernel: amr0: 1869f000/2048 [...] In a separate lock up, there are no messages to syslog, but all accesses to the card are hung. A ps shows my 26 bonnie processes are in either in 'wswbuf0' or 'biord' (going from memory here, I may not have the exact state text correct). This is the one I believe we are calling the "lost interrupt" bug. I'm running a patched 3/13 on this machine which I can't readily do a full cvs update on it. I believe that 3/13 was before Poul made his B_READ changes, so I did not incorporate Poul's 1.8 revision for amr.c (because I assume it would be incorrect to do so without getting all of his changes throughout the rest of the kernel). However, I did get all of your changes at 1.9. I also incorporated Markus' patch, with the exception that I set maxio to 253 instead of 127 or 254 like the card reports (thinking possibly that there was an off by one issue, i.e., 254 available, 0-253). It is this kernel that produced the messages above. Just for sanity's sake, I'll try Markus' maxio of 127 and verify whether or not my 26 simultaneous bonnie processes can finish without locking it up. I agree that we are probably chasing more than one problem. Also, I don't necessarily think you should back out the "volatile" change; even though it did not fix this problem, I think it should still be there. > > It sounds like Markus may be onto something. > > I'm somewhat corralled here today, but I might get some time to apply his > suggestions on Monday, especially if you're happy it works for you as > well. What we're thinking about doing here is that if scaling back the number of outstanding io requests hides/avoids the problem, then we may do that here as a temporary fix, especially if we can still get good performance. We have the need to get this machine into production soon. Ultimately, I'd like to get another card that we can play with and experiment with a bit more so that we can diagnose the real cause, and then be able to run the card a full steam. I am still able to work on this, though, at least for a few days. One area I thought about spending some time was where you maintain whether the card has interrupts enabled or not and based on this info, you issue commands with the expectation of getting an interrupt back or use polled mode. The next thing I was going to check was to review that part of the code thinking maybe that the software state might possibly have gotten out of sync with reality at some point. Also, I'm open to other suggestions if you think there's a more productive area I should spend time on. Thanks for your help on this. This card has a lot of promise, and the driver seems to be "so close". The fix is probably really simple, it's just eluding us for the moment. -Brian To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message