Date: Fri, 19 Feb 2010 01:06:26 +1100 From: Lawrence Stewart <lstewart@freebsd.org> To: Alexander Motin <mav@FreeBSD.org> Cc: svn-src-stable@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, svn-src-stable-8@FreeBSD.org Subject: Re: svn commit: r203889 - in stable/8/sys: cam cam/ata cam/scsi dev/ahci dev/asr dev/ata dev/ciss dev/hptiop dev/hptrr dev/mly dev/mpt dev/ppbus dev/siis dev/trm dev/twa dev/usb/storage Message-ID: <4B7D4962.8070706@freebsd.org> In-Reply-To: <201002141938.o1EJcRpx065470@svn.freebsd.org> References: <201002141938.o1EJcRpx065470@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Alexander and all, On 02/15/10 06:38, Alexander Motin wrote: > Author: mav > Date: Sun Feb 14 19:38:27 2010 > New Revision: 203889 > URL: http://svn.freebsd.org/changeset/base/203889 > > Log: > MFC r203108: > Large set of CAM improvements: [snip] I've been having issues with the mpt-driven LSI SAS adapter in my SunFire X4100 server running FreeBSD 8-STABLE r202132. Under certain disk workloads like running an svn update of the src tree or kernel compile, the disk subsystem will become extremely unresponsive in a stalled like state, and /var/log/messages will report a number of these: mpt0: mpt_cam_event: 0x16 It does eventually come good after a minute or two even though the svn op or build is still running, then it will maybe repeat a few times stalled/good behaviour sometimes with minutes between events. A couple of times it has gotten even more upset reporting things like this: mpt0: mpt_cam_event: 0x16 mpt0: mpt_cam_event: 0x16 mpt0: request 0xffffff80002f1400:54058 timed out for ccb 0xffffff0001c65000 (req->ccb 0xffffff0001c65000) mpt0: attempting to abort req 0xffffff80002f1400:54058 function 0 mpt0: request 0xffffff80002fd100:54059 timed out for ccb 0xffffff009f3ec800 (req->ccb 0xffffff009f3ec800) mpt0: request 0xffffff80002efcf0:54060 timed out for ccb 0xffffff0001bd2000 (req->ccb 0xffffff0001bd2000) mpt0: mpt_recover_commands: IOC Status 0x4a. Resetting controller. mpt0: mpt_cam_event: 0x0 mpt0: mpt_cam_event: 0x0 mpt0: completing timedout/aborted req 0xffffff80002f1400:54058 mpt0: completing timedout/aborted req 0xffffff80002fd100:54059 mpt0: completing timedout/aborted req 0xffffff80002efcf0:54060 mpt0: mpt_cam_event: 0x16 mpt0: mpt_cam_event: 0x12 mpt0: mpt_cam_event: 0x12 mpt0: mpt_cam_event: 0x16 mpt0: Volume(0:2): Volume Status Changed mpt0: request 0xffffff80002f8990:0 timed out for ccb 0xffffff009f3cb800 (req->ccb 0) No ill effects are observed after such an episode and the array remains in healthy as-normal state. The only observable problem is the stall of all disk IO while these events occur. The disk configuration is 2 x 320GB WD3200BEKT 7200RPM SATA HDDs in RAID1. The hardware reports itself as: mpt0: <LSILogic SAS/SATA Adapter> port 0xa800-0xa8ff mem 0xfc4fc000-0xfc4fffff,0xfc4e0000-0xfc4effff irq 28 at device 3.0 on pci2 mpt0: [ITHREAD] mpt0: MPI Version=1.5.13.0 mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) mpt0: 1 Active Volume (2 Max) mpt0: 2 Hidden Drive Members (10 Max) mpt0@pci0:2:3:0: class=0x010000 card=0x30601000 chip=0x00501000 rev=0x02 hdr=0x00 vendor = 'LSI Logic (Was: Symbios Logic, NCR)' device = 'SAS 3000 series, 4-port with 1064 -StorPort' class = mass storage subclass = SCSI As best I can tell, the hardware is ok, both disks report as fine without SMART errors and are only 2 months old, so wanted to rule out software issues. On upgrading to recent 8-STABLE, I got a page fault kernel panic on boot in the mpt driver mpt_raid0 kproc. After some trial and error, r203888 is the most recent revision that boots fine, whilst r203889 exhibits the page fault. I should also note that r203888 still sees the "mpt0: mpt_cam_event: 0x16" messages and associated disk IO stalls. I compiled DDB into my r203889 kernel. Unfortunately my ILO emulates a USB keyboard so I can't do anything in DDB which is a huge pain, but here's the info I did get (hand transcribed): Fatal trap 12: page fault while in kernel mode current process: mpt_raid0 Stopped at xpt_rescan+0x1d: movq 0x10(%rsi),%rdx So there are two separate issues here: 1. Any thoughts on how to resolve the regression in the mpt driver with the r203889 commit? 2. Any thoughts on the behaviour I'm seeing with the mpt_cam_event messages? Is it possible it's just a driver issue? Is the hardware likely bad? I'm really hoping they'll go away once the driver issue is resolved as the freezes are fairly unacceptable on a production machine and the hardware appears to pass all checks I've done so far. Cheers, Lawrence
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B7D4962.8070706>