Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 Mar 2011 19:33:16 -0500 (EST)
From:      Neil  Schelly <nschelly@dyn.com>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Serious Dell Sadness - H200, H700, and H800
Message-ID:  <4139036.97089.1299630796345.JavaMail.root@mail.corp>
In-Reply-To: <28269840.97080.1299630735538.JavaMail.root@mail.corp>

next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_97088_3026873.1299630796344
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

I'm sorry, I left out the attachment in my earlier post.  Please see it here.

--
Neil Schelly
Director of Uptime
Dynamic Network Services, Inc.
W: 603-296-1581
M: 508-410-4776
http://www.dyndns.com

----- Original Message -----
> We've got some more information about the mpt testing we've been doing
> here. The setup we're testing is Dell PowerEdge r610 servers with PERC
> H800 SAS/RAID cards connected to MD1200 shelves full of 12 SAS drives.
> We've recreated the same problem on other configurations, including
> combinations of r510s, MD1220 shelves, PERC H700 cards, etc. We've
> also eliminated any particular piece of hardware as faulty by running
> these on identical hardware configurations in mirrored setups on
> different physical pieces of hardware. We've experienced these issues
> in FreeBSD 7.3, 8.1, and 8.2. We've experienced this issue with either
> RAID10 logical drive configurations formatted with UFS or 6-disk JBOD
> configurations setup in a ZFS raidz volume. We've triggered the
> problem with both bonnie++ and iozone. All machines are runnning the
> latest firmware on the H700 and H800 cards.
> 
> The easiest method to reproduce this problem is with a ZFS
> configuration and using `iozone -a`. We have a 6-disk raidz partition
> with a ZFS filesystem on it. We just run `iozone -a` from within that
> filesystem, and I'd say 3 out of 4 times, it will eventually pause.
> After 45-50 seconds of pausing, you'll start seeing the console and
> /var/log/messages output that looks something like:
> mfi0: COMMAND 0xffffff8000db5fe0 TIMEOUT AFTER 105 SECONDS
> 
> If we let it go for a few days, it may actually "finish" and recover,
> but it's essentially just stuck and not recovering. The system is
> responsive and fully operational except the dead controller at this
> point. We cannot kill the iozone process that is hung on these IO
> operations, even with `kill -9`. Like others have reported, we can run
> any of the mfiutil commands and the controller immediately begins to
> respond normally again. Usually, the iozone test will complete, but
> sometimes it will even get stuck again on the same run.
> 
> We compiled mfiutil with debugging symbols so we could run it with gdb
> and see exactly what was causing the controller to become responsive
> again. It's the ioctl() call that does it. For example:
> 
> `mfiutil show volumes` eventually gets to something like:
> mfi_dcmd_command (fd=7, opcode=50397184, buf=0x7fffffffe4a0,
> bufsize=1032, mbox=0x0, mboxlen=0, statusp=0x0)
> at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257
> * fd=7 is /dev/mfi0, where the command will be sent with an ioctl
> command
> * opcode=50397184 is the MFI_DCMD_LD_GET_LIST command
> 
> `mfiutil show battery` eventually gets to something like:
> mfi_dcmd_command (fd=7, opcode=84017152, buf=0x7fffffffea20,
> bufsize=48, mbox=0x0, mboxlen=0, statusp=0x7fffffffe9cf "")
> at /usr/src/usr.sbin/mfiutil/mfi_cmd.c:257
> * fd=7 is /dev/mfi0, where the command will be sent with an ioctl
> command
> * opcode=84017152 is the MFI_DCMD_BBU_GET_CAPACITY_INFO command
> 
> I wrote a small self-contained C program that can easily be modified
> to run any ioctl command you'd like and send it to /dev/mfi0
> (attached). Use it if you'd like at your own risk, but it's
> essentially just running an arbitrary command with ioctl, putting
> nothing into the memory range normally passed by the *buf pointer. I
> did try sending random opcodes, and it didn't work, so it does have to
> be an opcode that the firmware will recognize at least, but it doesn't
> seem to matter which one.
> 
> I'm not sure where else we should be looking for a fix. We can
> reliably reproduce the problem, analyze the system during the issue,
> and recover the system to a normal state. If there's anyone who can
> help us troubleshoot this with any information we can gather or even a
> local login remotely accessible, we're open to ideas.
> 
> --
> Neil Schelly
> Director of Uptime
> Dynamic Network Services, Inc.
> W: 603-296-1581
> M: 508-410-4776
> http://www.dyndns.com

------=_Part_97088_3026873.1299630796344--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4139036.97089.1299630796345.JavaMail.root>