Date: Sat, 29 Sep 2007 21:18:17 -0400 From: "Benjie Chen" <benjie@addgene.org> To: "Sean McAfee" <smcafee@collaborativefusion.com> Cc: freebsd-hardware@freebsd.org Subject: Re: PERC5 (LSI MegaSAS) Patrol Read crashes Message-ID: <c53be070709291818u5b7b81d7l5ac6f318336f2101@mail.gmail.com> In-Reply-To: <46FD6E94.2080608@collaborativefusion.com> References: <46FD6E94.2080608@collaborativefusion.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I can confirm this problem on a PE1950 with the 6.2 i386 kernel as well. Manual patrol read started by megacli crashes the system. Thanks, Benjie On 9/28/07, Sean McAfee <smcafee@collaborativefusion.com> wrote: > > > We first became aware of this problem about a month ago. A database > server was up but was completely unresponsive to anything other than > pings. I power cycled it via the DRAC and after we couldn't find > anything suspicious in the logs, we figured it was a fluke. > > Until the next day, when its twin did the same exact thing. This time, > I was able to get a screen shot through the DRAC console. Using old > daily outputs and that screenshot, we correlated the crashes to patrol > reads. Since then, we've only seen it "in the wild" on one other > machine, a 1950, but I've been trying to chase the problem down without > much luck. > > I'm fortunate to have three machines at my disposal for this testing, so > I was able to try a variety of combinations: > > Server 1: > Chassis: 2950 v1 > System BIOS: 1.1.0 > PERC firmware: 1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package) > OS: 6.2-R_p7, 6-STABLE > > Server 2: > Chassis: 2950 v1 > System BIOS: 1.1.0 > PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package) > OS: 6.2-R_p7, 6-STABLE > > Server 3: > Chassis: 2950 v2 > System BIOS: 1.5.1 > PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package) > OS: 6.2-R_p7 > > They're all running amd64 and each combination was tried with and > without the linux_mfi.ko patches found in PR-113232. For disks, they all > have 2x36gb RAID1, 4x73gb RAID10 (all SAS.) We use > linux_mfi.ko+linux-megacli > for management. > > The original problem occurred during automatic patrol reads coupled with > heavy disk load. I've changed the delay interval for the automatic > patrol reads and tried to reproduce it but haven't had enough success to > make it useful for troubleshooting. Since the automatic reads are meant > to be as least aggressive as possible, I've been running a manual patrol > read (megacli -AdpPR -Start -a0), which triggers a crash regardless > of what I/O is like. > > The behavior has little to no variation; shortly after the read is > started, disk writes immediately cease (shown via an scp from another > machine). After a minute, the console will begin to fill up with lines > such as: > > mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS > > The first 8 values of the hex never change - I bring that up because I > suspect the problem has something to do with the enclosure, which is > attached at 8, 255, or fffffff, depending on where you're looking. > > I've let it go up to 6000 seconds, but it eventually ends in a kernel > panic. > That just seems to be a side effect of the original problem (processes > with > nowhere to write data), so I'm not too hung up on that. > > There's never anything pertaining to it in the controller's event log. > > Besides the platform version differences I mentioned above, I've tried: > - Reducing the patrol read rate > - Pulling down and modifying the patches from PR-115133 (which seems to > set an upper boundary at 0xffffffff) > - Invoking a0/aALL interchangeably > - Changing the cache flush interval > - Disabling disk coercion > - A bunch of other long-shot settings from megacli that aren't worth > listing > > Nothing has shown any appreciable difference in the behavior. > > Does anyone have an idea about what could be going on or anything else > we can try? For now, I'll probably just disable them and set them > to auto/1 hour delay during outage windows only, but I'm hoping that > someone is able to help with this. At the very least, maybe I can save > someone a whole bunch of time. > > Thanks in advance for any help. > > -- > Sean McAfee > Collaborative Fusion, Inc. > smcafee@collaborativefusion.com > 412-422-3463 x 4025 > > 1710 Murray Avenue, Suite 320 > Pittsburgh, PA 15217 > > **************************************************************** > IMPORTANT: This message contains confidential information > and is intended only for the individual named. If the reader of > this message is not an intended recipient (or the individual > responsible for the delivery of this message to an intended > recipient), please be advised that any re-use, dissemination, > distribution or copying of this message is prohibited. Please > notify the sender immediately by e-mail if you have received > this e-mail by mistake and delete this e-mail from your system. > E-mail transmission cannot be guaranteed to be secure or > error-free as information could be intercepted, corrupted, lost, > destroyed, arrive late or incomplete, or contain viruses. The > sender therefore does not accept liability for any errors or > omissions in the contents of this message, which arise as a > result of e-mail transmission. > **************************************************************** > > _______________________________________________ > freebsd-hardware@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hardware > To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org > " > -- Benjie Chen, Ph.D. Addgene, a better way to share plasmids www.addgene.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?c53be070709291818u5b7b81d7l5ac6f318336f2101>