From owner-freebsd-hardware@FreeBSD.ORG  Sun Sep 30 01:18:18 2007
Return-Path: <owner-freebsd-hardware@FreeBSD.ORG>
Delivered-To: freebsd-hardware@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6CD8D16A418
	for <freebsd-hardware@freebsd.org>;
	Sun, 30 Sep 2007 01:18:18 +0000 (UTC)
	(envelope-from benjie@addgene.org)
Received: from wa-out-1112.google.com (wa-out-1112.google.com [209.85.146.181])
	by mx1.freebsd.org (Postfix) with ESMTP id 5201913C465
	for <freebsd-hardware@freebsd.org>;
	Sun, 30 Sep 2007 01:18:18 +0000 (UTC)
	(envelope-from benjie@addgene.org)
Received: by wa-out-1112.google.com with SMTP id k17so4145196waf
	for <freebsd-hardware@freebsd.org>;
	Sat, 29 Sep 2007 18:18:18 -0700 (PDT)
Received: by 10.114.77.1 with SMTP id z1mr2187486waa.1191115097561;
	Sat, 29 Sep 2007 18:18:17 -0700 (PDT)
Received: by 10.114.15.16 with HTTP; Sat, 29 Sep 2007 18:18:17 -0700 (PDT)
Message-ID: <c53be070709291818u5b7b81d7l5ac6f318336f2101@mail.gmail.com>
Date: Sat, 29 Sep 2007 21:18:17 -0400
From: "Benjie Chen" <benjie@addgene.org>
To: "Sean McAfee" <smcafee@collaborativefusion.com>
In-Reply-To: <46FD6E94.2080608@collaborativefusion.com>
MIME-Version: 1.0
References: <46FD6E94.2080608@collaborativefusion.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: freebsd-hardware@freebsd.org
Subject: Re: PERC5 (LSI MegaSAS) Patrol Read crashes
X-BeenThere: freebsd-hardware@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: General discussion of FreeBSD hardware <freebsd-hardware.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hardware>
List-Post: <mailto:freebsd-hardware@freebsd.org>
List-Help: <mailto:freebsd-hardware-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Sep 2007 01:18:18 -0000

I can confirm this problem on a PE1950 with the 6.2 i386 kernel as well.
Manual patrol read started by megacli crashes the system.

Thanks,
Benjie

On 9/28/07, Sean McAfee <smcafee@collaborativefusion.com> wrote:
>
>
> We first became aware of this problem about a month ago.  A database
> server was up but was completely unresponsive to anything other than
> pings.  I power cycled it via the DRAC and after we couldn't find
> anything suspicious in the logs, we figured it was a fluke.
>
> Until the next day, when its twin did the same exact thing.   This time,
> I was able to get a screen shot through the DRAC console.  Using old
> daily outputs and that screenshot, we correlated the crashes to patrol
> reads.  Since then, we've only seen it "in the wild" on one other
> machine, a 1950, but I've been trying to chase the problem down without
> much luck.
>
> I'm fortunate to have three machines at my disposal for this testing, so
> I was able to try a variety of combinations:
>
> Server 1:
> Chassis:          2950 v1
> System BIOS:      1.1.0
> PERC firmware:    1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package)
> OS:               6.2-R_p7, 6-STABLE
>
> Server 2:
> Chassis:          2950 v1
> System BIOS:      1.1.0
> PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
> OS:               6.2-R_p7, 6-STABLE
>
> Server 3:
> Chassis:          2950 v2
> System BIOS:      1.5.1
> PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
> OS:               6.2-R_p7
>
> They're all running amd64 and each combination was tried with and
> without the linux_mfi.ko patches found in PR-113232.  For disks, they all
> have 2x36gb RAID1, 4x73gb RAID10 (all SAS.)  We use
> linux_mfi.ko+linux-megacli
> for management.
>
> The original problem occurred during automatic patrol reads coupled with
> heavy disk load.  I've changed the delay interval for the automatic
> patrol reads and tried to reproduce it but haven't had enough success to
> make it useful for troubleshooting.  Since the automatic reads are meant
> to be as least aggressive as possible, I've been running a manual patrol
> read (megacli -AdpPR -Start -a0), which triggers a crash regardless
> of what I/O is like.
>
> The behavior has little to no variation; shortly after the read is
> started, disk writes immediately cease (shown via an scp from another
> machine).  After a minute, the console will begin to fill up with lines
> such as:
>
> mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS
>
> The first 8 values of the hex never change - I bring that up because I
> suspect the problem has something to do with the enclosure, which is
> attached at 8, 255, or fffffff, depending on where you're looking.
>
> I've let it go up to 6000 seconds, but it eventually ends in a kernel
> panic.
> That just seems to be a side effect of the original problem (processes
> with
> nowhere to write data), so I'm not too hung up on that.
>
> There's never anything pertaining to it in the controller's event log.
>
> Besides the platform version differences I mentioned above, I've tried:
> - Reducing the patrol read rate
> - Pulling down and modifying the patches from PR-115133 (which seems to
> set an upper boundary at 0xffffffff)
> - Invoking a0/aALL interchangeably
> - Changing the cache flush interval
> - Disabling disk coercion
> - A bunch of other long-shot settings from megacli that aren't worth
> listing
>
> Nothing has shown any appreciable difference in the behavior.
>
> Does anyone have an idea about what could be going on or anything else
> we can try?  For now, I'll probably just disable them and set them
> to auto/1 hour delay during outage windows only, but I'm hoping that
> someone is able to help with this.  At the very least, maybe I can save
> someone a whole bunch of time.
>
> Thanks in advance for any help.
>
> --
> Sean McAfee
> Collaborative Fusion, Inc.
>   smcafee@collaborativefusion.com
>   412-422-3463 x 4025
>
> 1710 Murray Avenue, Suite 320
> Pittsburgh, PA 15217
>
> ****************************************************************
> IMPORTANT: This message contains confidential information
> and is intended only for the individual named. If the reader of
> this message is not an intended recipient (or the individual
> responsible for the delivery of this message to an intended
> recipient), please be advised that any re-use, dissemination,
> distribution or copying of this message is prohibited. Please
> notify the sender immediately by e-mail if you have received
> this e-mail by mistake and delete this e-mail from your system.
> E-mail transmission cannot be guaranteed to be secure or
> error-free as information could be intercepted, corrupted, lost,
> destroyed, arrive late or incomplete, or contain viruses. The
> sender therefore does not accept liability for any errors or
> omissions in the contents of this message, which arise as a
> result of e-mail transmission.
> ****************************************************************
>
> _______________________________________________
> freebsd-hardware@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
> To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org
> "
>


-- 
Benjie Chen, Ph.D.
Addgene, a better way to share plasmids
www.addgene.org