Date: Tue, 8 Mar 2011 19:33:16 -0500 (EST) From: Neil Schelly <nschelly@dyn.com> To: freebsd-scsi@freebsd.org Subject: Re: Serious Dell Sadness - H200, H700, and H800 Message-ID: <4139036.97089.1299630796345.JavaMail.root@mail.corp> In-Reply-To: <28269840.97080.1299630735538.JavaMail.root@mail.corp>
next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_97088_3026873.1299630796344 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit I'm sorry, I left out the attachment in my earlier post. Please see it here. -- Neil Schelly Director of Uptime Dynamic Network Services, Inc. W: 603-296-1581 M: 508-410-4776 http://www.dyndns.com ----- Original Message ----- > We've got some more information about the mpt testing we've been doing > here. The setup we're testing is Dell PowerEdge r610 servers with PERC > H800 SAS/RAID cards connected to MD1200 shelves full of 12 SAS drives. > We've recreated the same problem on other configurations, including > combinations of r510s, MD1220 shelves, PERC H700 cards, etc. We've > also eliminated any particular piece of hardware as faulty by running > these on identical hardware configurations in mirrored setups on > different physical pieces of hardware. We've experienced these issues > in FreeBSD 7.3, 8.1, and 8.2. We've experienced this issue with either > RAID10 logical drive configurations formatted with UFS or 6-disk JBOD > configurations setup in a ZFS raidz volume. We've triggered the > problem with both bonnie++ and iozone. All machines are runnning the > latest firmware on the H700 and H800 cards. > > The easiest method to reproduce this problem is with a ZFS > configuration and using `iozone -a`. We have a 6-disk raidz partition > with a ZFS filesystem on it. We just run `iozone -a` from within that > filesystem, and I'd say 3 out of 4 times, it will eventually pause. > After 45-50 seconds of pausing, you'll start seeing the console and > /var/log/messages output that looks something like: > mfi0: COMMAND 0xffffff8000db5fe0 TIMEOUT AFTER 105 SECONDS > > If we let it go for a few days, it may actually "finish" and recover, > but it's essentially just stuck and not recovering. The system is > responsive and fully operational except the dead controller at this > point. We cannot kill the iozone process that is hung on these IO > operations, even with `kill -9`. Like others have reported, we can run > any of the mfiutil commands and the controller immediately begins to > respond normally again. Usually, the iozone test will complete, but > sometimes it will even get stuck again on the same run. > > We compiled mfiutil with debugging symbols so we could run it with gdb > and see exactly what was causing the controller to become responsive > again. It's the ioctl() call that does it. For example: > > `mfiutil show volumes` eventually gets to something like: > mfi_dcmd_command (fd=7, opcode=50397184, buf=0x7fffffffe4a0, > bufsize=1032, mbox=0x0, mboxlen=0, statusp=0x0) > at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257 > * fd=7 is /dev/mfi0, where the command will be sent with an ioctl > command > * opcode=50397184 is the MFI_DCMD_LD_GET_LIST command > > `mfiutil show battery` eventually gets to something like: > mfi_dcmd_command (fd=7, opcode=84017152, buf=0x7fffffffea20, > bufsize=48, mbox=0x0, mboxlen=0, statusp=0x7fffffffe9cf "") > at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257 > * fd=7 is /dev/mfi0, where the command will be sent with an ioctl > command > * opcode=84017152 is the MFI_DCMD_BBU_GET_CAPACITY_INFO command > > I wrote a small self-contained C program that can easily be modified > to run any ioctl command you'd like and send it to /dev/mfi0 > (attached). Use it if you'd like at your own risk, but it's > essentially just running an arbitrary command with ioctl, putting > nothing into the memory range normally passed by the *buf pointer. I > did try sending random opcodes, and it didn't work, so it does have to > be an opcode that the firmware will recognize at least, but it doesn't > seem to matter which one. > > I'm not sure where else we should be looking for a fix. We can > reliably reproduce the problem, analyze the system during the issue, > and recover the system to a normal state. If there's anyone who can > help us troubleshoot this with any information we can gather or even a > local login remotely accessible, we're open to ideas. > > -- > Neil Schelly > Director of Uptime > Dynamic Network Services, Inc. > W: 603-296-1581 > M: 508-410-4776 > http://www.dyndns.com ------=_Part_97088_3026873.1299630796344--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4139036.97089.1299630796345.JavaMail.root>