From owner-freebsd-current Mon Mar 20 18:57:52 2000 Delivered-To: freebsd-current@freebsd.org Received: from lamb.sas.com (lamb.sas.com [192.35.83.8]) by hub.freebsd.org (Postfix) with ESMTP id 92DA637C330; Mon, 20 Mar 2000 18:56:08 -0800 (PST) (envelope-from jwd@unx.sas.com) Received: from mozart (mozart.unx.sas.com [149.173.6.8]) by lamb.sas.com (8.9.3/8.9.1) with SMTP id VAA27384; Mon, 20 Mar 2000 21:55:58 -0500 (EST) Received: from bb01f39.unx.sas.com by mozart (5.65c/SAS/Domains/5-6-90) id AA07732; Mon, 20 Mar 2000 21:55:27 -0500 Received: (from jwd@localhost) by bb01f39.unx.sas.com (8.9.3/8.9.1) id VAA24932; Mon, 20 Mar 2000 21:55:27 -0500 (EST) (envelope-from jwd) From: "John W. DeBoskey" Message-Id: <200003210255.VAA24932@bb01f39.unx.sas.com> Subject: Re: AMI MegaRAID lockup? not accepting commands. In-Reply-To: <200003210146.RAA15576@mass.cdrom.com> from Mike Smith at "Mar 20, 2000 05:46:50 pm" To: Mike Smith Date: Mon, 20 Mar 2000 21:55:27 -0500 (EST) Cc: freebsd-current@freebsd.org, Brad Chisholm X-Mailer: ELM [version 2.4ME+ PL61 (25)] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hi, The controller is new. Dell calls it a Perc2/dc and it has 128Meg of memory installed in it. I'm not sitting infront of the machine right now. More detailed information is available when the machines is booted and you enter the bios setup on the adapter card. > > We have a system with a new AMI card in it controlling a pair > > of shelves from Dell (fbsd dated: 4.0-20000313-SNAP). > > > > The relevant dmesg output is below: (complete dmesg at end) > > > > amr0: mem 0xf6c00000-0xf6ffffff irq 14 at device 10.1 on pci2 > > amr0: firmware 1.01 bios 1p00 128MB memory > > amrd0: on amr0 > > amrd0: 172780MB (353853440 sectors) RAID 5 (optimal) > > > > The adapter does not lockup while testing with bonnie and such. > > Try running 20 or so bonnie processes in parallel; I can usually get it > to lock up with this configuration. I'm wondering which controller > you've got there though - I don't recognise the BIOS/firmware versions. > > > However, we have a 50Gig CVS repository sitting on the raid > > volume. When we do a 'cvs co' of -HEAD, it causes it to lockup. > > The following messages are repeating continuously: > > > > Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands) > > I'm not sure why this happens; the controller isn't coming ready even > though we haven't hit any sort of limit that we're aware of. I've been > considering some workarounds involving deferring the command until the > controller gives us back an interrupt, but I'm still surprised that we > get to this point at all. Well, we've been playing around in amr.c/amr_start in the following code sequence: /* spin waiting for the mailbox */ debug("wait for mailbox"); for (i = 10000, done = 0, worked = 0; (i > 0) && !done; i--) { s = splbio(); /* is the mailbox free? */ if (sc->amr_mailbox->mb_busy == 0) { debug("got mailbox"); sc->amr_mailbox64->mb64_segment = 0; bcopy(&ac->ac_mailbox, sc->amr_mailbox, AMR_MBOX_CMDSIZE); sc->amr_submit_command(sc); done = 1; sc->amr_workcount++; TAILQ_INSERT_TAIL(&sc->amr_work, ac, ac_link); /* not free, try to clean up while we wait */ } else { -->> printf("%s: busy flag %x\n", __FUNCTION__, sc->amr_mailbox->mb_busy); debug("busy flag %x\n", sc->amr_mailbox->mb_busy); worked = amr_done(sc); } splx(s); } Note the addition of the printf statement in the else clause. Two interesting things happen. One, we are unable to cause the controller to lock up. Two, the following messages showup in syslog: Mar 20 12:55:15 cvsstage /kernel: amr_start: busy flag 1 Mar 20 12:55:46 cvsstage last message repeated 1057 times Mar 20 12:57:47 cvsstage last message repeated 5574 times Mar 20 12:59:26 cvsstage last message repeated 5431 times Mar 20 12:59:26 cvsstage /kernel: amr_start: busy flag 0 If I understand the sequence correctly, we enter splbio() and then check the mailbox. Most of the time, we take the else clause and the busy flag is 1 as it should be. However, once every 10 to 12 thousand loops, mb_busy is checked as being 1, but by the time we get to the else clause, it's 0. I wonder if there is some sort of timing issue since the addition of the printf allows the card to operate correctly. I haven't traced the kernel printf code, but it could change the spl level thus allowing the mb_busy flag to be modified. Comments? > > Unfortunately, I'm not able to spend any time on this at the moment; if > someone wants to do a little experimenting I'd be very happy to talk them > through what I think should be done (will require some programming > ability). We're more than willing to try. Just point us in the right direction. > -- > \\ Give a man a fish, and you feed him for a day. \\ Mike Smith > \\ Tell him he should learn how to fish himself, \\ msmith@freebsd.org > \\ and he'll hate you for a lifetime. \\ msmith@cdrom.com -John To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message