From owner-freebsd-scsi@FreeBSD.ORG Fri May 4 05:29:20 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9E841065670 for ; Fri, 4 May 2012 05:29:20 +0000 (UTC) (envelope-from spork@bway.net) Received: from xena.bway.net (xena.bway.net [216.220.96.26]) by mx1.freebsd.org (Postfix) with ESMTP id 772DC8FC0C for ; Fri, 4 May 2012 05:29:20 +0000 (UTC) Received: (qmail 64828 invoked by uid 0); 4 May 2012 05:29:14 -0000 Received: from smtp.bway.net (216.220.96.25) by xena.bway.net with ESMTPS (DHE-RSA-AES256-SHA encrypted); 4 May 2012 05:29:14 -0000 Received: (qmail 64824 invoked by uid 90); 4 May 2012 05:29:13 -0000 Received: from unknown (HELO ?10.3.2.41?) (spork@96.57.144.66) by smtp.bway.net with ESMTPA; 4 May 2012 05:29:13 -0000 From: Charles Sprickman Content-Type: text/plain; charset=us-ascii Message-Id: Date: Fri, 4 May 2012 01:29:13 -0400 To: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) Subject: mfi and "copy out failed" messages X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 May 2012 05:29:20 -0000 I'm wondering if anyone has some interest in this issue, I recently = think I tracked down a long-standing fs corruption and panic issue on a = Dell 2970 that I was never able to solve: http://lists.freebsd.org/pipermail/freebsd-fs/2010-July/008858.html = (there are other threads, but that's the gist of the issue) I'd read in various threads that the "mfiX: Copy out failed" was a = harmless message. But recently I started thinking that there had to be = some relation between those messages and the panics. The timing fits - = I had megacli performing a status check on the controller in a periodic = script that kicked off with the daily run. Most of my panics were = during or shortly after the daily run. The "Copy out failed" messages = always corresponded to megacli being run. 132 days ago I removed the daily megacli check and the box has not had a = kernel panic since then. Previous to this my longest uptime was not = more than a few months. While this is by no means 100% definitive, it = sure seems like something is going on here. My best guess is that = megacli and/or the mfi driver are interacting in a bad way and that the = "Copy out failed" message is indicating something did not hit the = controller that should have. My earlier assumption was that it was just = some control message megacli was sending that didn't make it, but now = I'm thinking it's some request to write actual data to the drive that's = failing. As a reminder, the card in question is: mfi0: port 0xec00-0xecff mem = 0xe9f80000-0xe9fbffff,0xe9fc0000-0xe9ffffff irq 37 at device 0.0 on pci7 mfi0: 3049 (boot + 3s/0x0020/info) - Firmware version 1.22.02-0612 mfi0: 3051 (boot + 23s/0x0020/info) - Controller hardware revision ID = (0x0) mfi0: 3052 (boot + 23s/0x0020/info) - Package version 6.2.0-0013 If anyone with knowledge of the mfi driver would like to comment, I'd = very much appreciate it. This box is going to be repurposed in the = coming months as an ESXi host to hold some backup/standby VMs, but = before that I would not mind taking some time to test any patches, extra = debug printfs in mfi, etc. I suspect I can probably trigger the panic = pretty easily by mimicking the daily run conditions - just kick off a = find from "/" and then repeatedly loop the megacli command to check the = array health. =20 The box is still on 7.3, but I'd gladly upgrade to 8.3 and test there if = needed once the box is freed up. Thanks, Charles -- Charles Sprickman NetEng/SysAdmin Bway.net - New York's Best Internet www.bway.net spork@bway.net - 212.655.9344