From owner-freebsd-hardware@FreeBSD.ORG Sun Sep 30 01:18:18 2007 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6CD8D16A418 for ; Sun, 30 Sep 2007 01:18:18 +0000 (UTC) (envelope-from benjie@addgene.org) Received: from wa-out-1112.google.com (wa-out-1112.google.com [209.85.146.181]) by mx1.freebsd.org (Postfix) with ESMTP id 5201913C465 for ; Sun, 30 Sep 2007 01:18:18 +0000 (UTC) (envelope-from benjie@addgene.org) Received: by wa-out-1112.google.com with SMTP id k17so4145196waf for ; Sat, 29 Sep 2007 18:18:18 -0700 (PDT) Received: by 10.114.77.1 with SMTP id z1mr2187486waa.1191115097561; Sat, 29 Sep 2007 18:18:17 -0700 (PDT) Received: by 10.114.15.16 with HTTP; Sat, 29 Sep 2007 18:18:17 -0700 (PDT) Message-ID: Date: Sat, 29 Sep 2007 21:18:17 -0400 From: "Benjie Chen" To: "Sean McAfee" In-Reply-To: <46FD6E94.2080608@collaborativefusion.com> MIME-Version: 1.0 References: <46FD6E94.2080608@collaborativefusion.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-hardware@freebsd.org Subject: Re: PERC5 (LSI MegaSAS) Patrol Read crashes X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Sep 2007 01:18:18 -0000 I can confirm this problem on a PE1950 with the 6.2 i386 kernel as well. Manual patrol read started by megacli crashes the system. Thanks, Benjie On 9/28/07, Sean McAfee wrote: > > > We first became aware of this problem about a month ago. A database > server was up but was completely unresponsive to anything other than > pings. I power cycled it via the DRAC and after we couldn't find > anything suspicious in the logs, we figured it was a fluke. > > Until the next day, when its twin did the same exact thing. This time, > I was able to get a screen shot through the DRAC console. Using old > daily outputs and that screenshot, we correlated the crashes to patrol > reads. Since then, we've only seen it "in the wild" on one other > machine, a 1950, but I've been trying to chase the problem down without > much luck. > > I'm fortunate to have three machines at my disposal for this testing, so > I was able to try a variety of combinations: > > Server 1: > Chassis: 2950 v1 > System BIOS: 1.1.0 > PERC firmware: 1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package) > OS: 6.2-R_p7, 6-STABLE > > Server 2: > Chassis: 2950 v1 > System BIOS: 1.1.0 > PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package) > OS: 6.2-R_p7, 6-STABLE > > Server 3: > Chassis: 2950 v2 > System BIOS: 1.5.1 > PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package) > OS: 6.2-R_p7 > > They're all running amd64 and each combination was tried with and > without the linux_mfi.ko patches found in PR-113232. For disks, they all > have 2x36gb RAID1, 4x73gb RAID10 (all SAS.) We use > linux_mfi.ko+linux-megacli > for management. > > The original problem occurred during automatic patrol reads coupled with > heavy disk load. I've changed the delay interval for the automatic > patrol reads and tried to reproduce it but haven't had enough success to > make it useful for troubleshooting. Since the automatic reads are meant > to be as least aggressive as possible, I've been running a manual patrol > read (megacli -AdpPR -Start -a0), which triggers a crash regardless > of what I/O is like. > > The behavior has little to no variation; shortly after the read is > started, disk writes immediately cease (shown via an scp from another > machine). After a minute, the console will begin to fill up with lines > such as: > > mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS > > The first 8 values of the hex never change - I bring that up because I > suspect the problem has something to do with the enclosure, which is > attached at 8, 255, or fffffff, depending on where you're looking. > > I've let it go up to 6000 seconds, but it eventually ends in a kernel > panic. > That just seems to be a side effect of the original problem (processes > with > nowhere to write data), so I'm not too hung up on that. > > There's never anything pertaining to it in the controller's event log. > > Besides the platform version differences I mentioned above, I've tried: > - Reducing the patrol read rate > - Pulling down and modifying the patches from PR-115133 (which seems to > set an upper boundary at 0xffffffff) > - Invoking a0/aALL interchangeably > - Changing the cache flush interval > - Disabling disk coercion > - A bunch of other long-shot settings from megacli that aren't worth > listing > > Nothing has shown any appreciable difference in the behavior. > > Does anyone have an idea about what could be going on or anything else > we can try? For now, I'll probably just disable them and set them > to auto/1 hour delay during outage windows only, but I'm hoping that > someone is able to help with this. At the very least, maybe I can save > someone a whole bunch of time. > > Thanks in advance for any help. > > -- > Sean McAfee > Collaborative Fusion, Inc. > smcafee@collaborativefusion.com > 412-422-3463 x 4025 > > 1710 Murray Avenue, Suite 320 > Pittsburgh, PA 15217 > > **************************************************************** > IMPORTANT: This message contains confidential information > and is intended only for the individual named. If the reader of > this message is not an intended recipient (or the individual > responsible for the delivery of this message to an intended > recipient), please be advised that any re-use, dissemination, > distribution or copying of this message is prohibited. Please > notify the sender immediately by e-mail if you have received > this e-mail by mistake and delete this e-mail from your system. > E-mail transmission cannot be guaranteed to be secure or > error-free as information could be intercepted, corrupted, lost, > destroyed, arrive late or incomplete, or contain viruses. The > sender therefore does not accept liability for any errors or > omissions in the contents of this message, which arise as a > result of e-mail transmission. > **************************************************************** > > _______________________________________________ > freebsd-hardware@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hardware > To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org > " > -- Benjie Chen, Ph.D. Addgene, a better way to share plasmids www.addgene.org