From owner-freebsd-scsi@FreeBSD.ORG Wed Mar 9 00:49:35 2011 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3548E1065673 for ; Wed, 9 Mar 2011 00:49:35 +0000 (UTC) (envelope-from nschelly@dyn.com) Received: from dynmail-01-mht.dyndns.com (dynmail-01-mht.dyndns.com [216.146.45.13]) by mx1.freebsd.org (Postfix) with ESMTP id 9762E8FC17 for ; Wed, 9 Mar 2011 00:49:34 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by dynmail-01-mht.dyndns.com (Postfix) with ESMTP id 008231752014 for ; Tue, 8 Mar 2011 19:33:17 -0500 (EST) X-Virus-Scanned: amavisd-new at dynmail-01-mht.dyndns.com Received: from dynmail-01-mht.dyndns.com ([127.0.0.1]) by localhost (dynmail-01-mht.dyndns.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ybup2-eL0wlS for ; Tue, 8 Mar 2011 19:33:16 -0500 (EST) Received: from mail.corp.dyndns.com (mail.corp.dyndns.com [216.146.45.14]) by dynmail-01-mht.dyndns.com (Postfix) with ESMTP id 5CE981752012 for ; Tue, 8 Mar 2011 19:33:16 -0500 (EST) Date: Tue, 8 Mar 2011 19:33:16 -0500 (EST) From: Neil Schelly To: freebsd-scsi@freebsd.org Message-ID: <4139036.97089.1299630796345.JavaMail.root@mail.corp> In-Reply-To: <28269840.97080.1299630735538.JavaMail.root@mail.corp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_97088_3026873.1299630796344" X-Originating-IP: [172.16.252.166] X-Mailer: Zimbra 6.0.7_GA_2473.UBUNTU8 (ZimbraWebClient - SAF3 (Linux)/6.0.7_GA_2473.UBUNTU8) X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: Serious Dell Sadness - H200, H700, and H800 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Mar 2011 00:49:35 -0000 ------=_Part_97088_3026873.1299630796344 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit I'm sorry, I left out the attachment in my earlier post. Please see it here. -- Neil Schelly Director of Uptime Dynamic Network Services, Inc. W: 603-296-1581 M: 508-410-4776 http://www.dyndns.com ----- Original Message ----- > We've got some more information about the mpt testing we've been doing > here. The setup we're testing is Dell PowerEdge r610 servers with PERC > H800 SAS/RAID cards connected to MD1200 shelves full of 12 SAS drives. > We've recreated the same problem on other configurations, including > combinations of r510s, MD1220 shelves, PERC H700 cards, etc. We've > also eliminated any particular piece of hardware as faulty by running > these on identical hardware configurations in mirrored setups on > different physical pieces of hardware. We've experienced these issues > in FreeBSD 7.3, 8.1, and 8.2. We've experienced this issue with either > RAID10 logical drive configurations formatted with UFS or 6-disk JBOD > configurations setup in a ZFS raidz volume. We've triggered the > problem with both bonnie++ and iozone. All machines are runnning the > latest firmware on the H700 and H800 cards. > > The easiest method to reproduce this problem is with a ZFS > configuration and using `iozone -a`. We have a 6-disk raidz partition > with a ZFS filesystem on it. We just run `iozone -a` from within that > filesystem, and I'd say 3 out of 4 times, it will eventually pause. > After 45-50 seconds of pausing, you'll start seeing the console and > /var/log/messages output that looks something like: > mfi0: COMMAND 0xffffff8000db5fe0 TIMEOUT AFTER 105 SECONDS > > If we let it go for a few days, it may actually "finish" and recover, > but it's essentially just stuck and not recovering. The system is > responsive and fully operational except the dead controller at this > point. We cannot kill the iozone process that is hung on these IO > operations, even with `kill -9`. Like others have reported, we can run > any of the mfiutil commands and the controller immediately begins to > respond normally again. Usually, the iozone test will complete, but > sometimes it will even get stuck again on the same run. > > We compiled mfiutil with debugging symbols so we could run it with gdb > and see exactly what was causing the controller to become responsive > again. It's the ioctl() call that does it. For example: > > `mfiutil show volumes` eventually gets to something like: > mfi_dcmd_command (fd=7, opcode=50397184, buf=0x7fffffffe4a0, > bufsize=1032, mbox=0x0, mboxlen=0, statusp=0x0) > at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257 > * fd=7 is /dev/mfi0, where the command will be sent with an ioctl > command > * opcode=50397184 is the MFI_DCMD_LD_GET_LIST command > > `mfiutil show battery` eventually gets to something like: > mfi_dcmd_command (fd=7, opcode=84017152, buf=0x7fffffffea20, > bufsize=48, mbox=0x0, mboxlen=0, statusp=0x7fffffffe9cf "") > at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257 > * fd=7 is /dev/mfi0, where the command will be sent with an ioctl > command > * opcode=84017152 is the MFI_DCMD_BBU_GET_CAPACITY_INFO command > > I wrote a small self-contained C program that can easily be modified > to run any ioctl command you'd like and send it to /dev/mfi0 > (attached). Use it if you'd like at your own risk, but it's > essentially just running an arbitrary command with ioctl, putting > nothing into the memory range normally passed by the *buf pointer. I > did try sending random opcodes, and it didn't work, so it does have to > be an opcode that the firmware will recognize at least, but it doesn't > seem to matter which one. > > I'm not sure where else we should be looking for a fix. We can > reliably reproduce the problem, analyze the system during the issue, > and recover the system to a normal state. If there's anyone who can > help us troubleshoot this with any information we can gather or even a > local login remotely accessible, we're open to ideas. > > -- > Neil Schelly > Director of Uptime > Dynamic Network Services, Inc. > W: 603-296-1581 > M: 508-410-4776 > http://www.dyndns.com ------=_Part_97088_3026873.1299630796344--