From owner-freebsd-hardware@FreeBSD.ORG Fri Sep 28 21:23:24 2007 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2752716A418 for ; Fri, 28 Sep 2007 21:23:24 +0000 (UTC) (envelope-from smcafee@collaborativefusion.com) Received: from mx00.pub.collaborativefusion.com (mx00.pub.collaborativefusion.com [206.210.89.199]) by mx1.freebsd.org (Postfix) with ESMTP id C13B313C465 for ; Fri, 28 Sep 2007 21:23:23 +0000 (UTC) (envelope-from smcafee@collaborativefusion.com) Received: from [192.168.2.72] (icepick.pitbpa0.priv.collaborativefusion.com [192.168.2.72]) (AUTH: LOGIN smcafee, SSL: TLSv1/SSLv3,256bits,AES256-SHA) by wingspan with esmtp; Fri, 28 Sep 2007 17:13:21 -0400 id 00056421.46FD6E71.000104EB Message-ID: <46FD6E94.2080608@collaborativefusion.com> Date: Fri, 28 Sep 2007 17:13:56 -0400 From: Sean McAfee User-Agent: Thunderbird 2.0.0.6 (X11/20070810) MIME-Version: 1.0 To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: PERC5 (LSI MegaSAS) Patrol Read crashes X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Sep 2007 21:23:24 -0000 We first became aware of this problem about a month ago. A database server was up but was completely unresponsive to anything other than pings. I power cycled it via the DRAC and after we couldn't find anything suspicious in the logs, we figured it was a fluke. Until the next day, when its twin did the same exact thing. This time, I was able to get a screen shot through the DRAC console. Using old daily outputs and that screenshot, we correlated the crashes to patrol reads. Since then, we've only seen it "in the wild" on one other machine, a 1950, but I've been trying to chase the problem down without much luck. I'm fortunate to have three machines at my disposal for this testing, so I was able to try a variety of combinations: Server 1: Chassis: 2950 v1 System BIOS: 1.1.0 PERC firmware: 1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package) OS: 6.2-R_p7, 6-STABLE Server 2: Chassis: 2950 v1 System BIOS: 1.1.0 PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package) OS: 6.2-R_p7, 6-STABLE Server 3: Chassis: 2950 v2 System BIOS: 1.5.1 PERC firmware: 1.03.10-0216 PERC F/W (from the 5.1.1-0040 package) OS: 6.2-R_p7 They're all running amd64 and each combination was tried with and without the linux_mfi.ko patches found in PR-113232. For disks, they all have 2x36gb RAID1, 4x73gb RAID10 (all SAS.) We use linux_mfi.ko+linux-megacli for management. The original problem occurred during automatic patrol reads coupled with heavy disk load. I've changed the delay interval for the automatic patrol reads and tried to reproduce it but haven't had enough success to make it useful for troubleshooting. Since the automatic reads are meant to be as least aggressive as possible, I've been running a manual patrol read (megacli -AdpPR -Start -a0), which triggers a crash regardless of what I/O is like. The behavior has little to no variation; shortly after the read is started, disk writes immediately cease (shown via an scp from another machine). After a minute, the console will begin to fill up with lines such as: mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS The first 8 values of the hex never change - I bring that up because I suspect the problem has something to do with the enclosure, which is attached at 8, 255, or fffffff, depending on where you're looking. I've let it go up to 6000 seconds, but it eventually ends in a kernel panic. That just seems to be a side effect of the original problem (processes with nowhere to write data), so I'm not too hung up on that. There's never anything pertaining to it in the controller's event log. Besides the platform version differences I mentioned above, I've tried: - Reducing the patrol read rate - Pulling down and modifying the patches from PR-115133 (which seems to set an upper boundary at 0xffffffff) - Invoking a0/aALL interchangeably - Changing the cache flush interval - Disabling disk coercion - A bunch of other long-shot settings from megacli that aren't worth listing Nothing has shown any appreciable difference in the behavior. Does anyone have an idea about what could be going on or anything else we can try? For now, I'll probably just disable them and set them to auto/1 hour delay during outage windows only, but I'm hoping that someone is able to help with this. At the very least, maybe I can save someone a whole bunch of time. Thanks in advance for any help. -- Sean McAfee Collaborative Fusion, Inc. smcafee@collaborativefusion.com 412-422-3463 x 4025 1710 Murray Avenue, Suite 320 Pittsburgh, PA 15217 **************************************************************** IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****************************************************************