From owner-svn-src-all@FreeBSD.ORG Sun Feb 21 23:39:13 2010 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CA2AB106566B; Sun, 21 Feb 2010 23:39:13 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 212CB8FC13; Sun, 21 Feb 2010 23:39:13 +0000 (UTC) Received: from lawrence1.loshell.room52.net (unknown [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id 9EAFE7E87D; Mon, 22 Feb 2010 10:39:11 +1100 (EST) Message-ID: <4B81C41F.2080601@freebsd.org> Date: Mon, 22 Feb 2010 10:39:11 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU; rv:1.9.1.5) Gecko/20100105 Thunderbird/3.0 MIME-Version: 1.0 To: Alexander Motin References: <201002141938.o1EJcRpx065470@svn.freebsd.org> <4B7D4962.8070706@freebsd.org> <4B7EC763.4090507@FreeBSD.org> In-Reply-To: <4B7EC763.4090507@FreeBSD.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: svn-src-stable@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, svn-src-stable-8@FreeBSD.org Subject: Re: svn commit: r203889 - in stable/8/sys: cam cam/ata cam/scsi dev/ahci dev/asr dev/ata dev/ciss dev/hptiop dev/hptrr dev/mly dev/mpt dev/ppbus dev/siis dev/trm dev/twa dev/usb/storage X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Feb 2010 23:39:14 -0000 On 02/20/10 04:16, Alexander Motin wrote: > Lawrence Stewart wrote: >> A couple of times it has gotten even more upset reporting things like this: >> >> mpt0: mpt_cam_event: 0x16 >> mpt0: mpt_cam_event: 0x16 >> mpt0: request 0xffffff80002f1400:54058 timed out for ccb >> 0xffffff0001c65000 (req->ccb 0xffffff0001c65000) >> mpt0: attempting to abort req 0xffffff80002f1400:54058 function 0 >> mpt0: request 0xffffff80002fd100:54059 timed out for ccb >> 0xffffff009f3ec800 (req->ccb 0xffffff009f3ec800) >> mpt0: request 0xffffff80002efcf0:54060 timed out for ccb >> 0xffffff0001bd2000 (req->ccb 0xffffff0001bd2000) >> mpt0: mpt_recover_commands: IOC Status 0x4a. Resetting controller. >> mpt0: mpt_cam_event: 0x0 >> mpt0: mpt_cam_event: 0x0 >> mpt0: completing timedout/aborted req 0xffffff80002f1400:54058 >> mpt0: completing timedout/aborted req 0xffffff80002fd100:54059 >> mpt0: completing timedout/aborted req 0xffffff80002efcf0:54060 >> mpt0: mpt_cam_event: 0x16 >> mpt0: mpt_cam_event: 0x12 >> mpt0: mpt_cam_event: 0x12 >> mpt0: mpt_cam_event: 0x16 >> mpt0: Volume(0:2): Volume Status Changed >> mpt0: request 0xffffff80002f8990:0 timed out for ccb 0xffffff009f3cb800 >> (req->ccb 0) >> >> No ill effects are observed after such an episode and the array remains >> in healthy as-normal state. The only observable problem is the stall of >> all disk IO while these events occur. > > I have no idea how mpt driver works, neither I have hardware to play, > but quick look shows that 0x12 event is MPI_EVENT_SAS_PHY_LINK_STATUS, > and 0x16 is MPI_EVENT_SAS_DISCOVERY. Both are not handled by mpt driver > and so logged. I would say something is going on at physical level of > your SAN. Timeouts are also could be the result of physical issues. Ok, I'll try and figure out what's possibly going on. > >> As best I can tell, the hardware is ok, both disks report as fine >> without SMART errors and are only 2 months old, so wanted to rule out >> software issues. On upgrading to recent 8-STABLE, I got a page fault >> kernel panic on boot in the mpt driver mpt_raid0 kproc. After some trial >> and error, r203888 is the most recent revision that boots fine, whilst >> r203889 exhibits the page fault. I should also note that r203888 still >> sees the "mpt0: mpt_cam_event: 0x16" messages and associated disk IO >> stalls. >> >> I compiled DDB into my r203889 kernel. Unfortunately my ILO emulates a >> USB keyboard so I can't do anything in DDB which is a huge pain, but >> here's the info I did get (hand transcribed): >> >> Fatal trap 12: page fault while in kernel mode >> current process: mpt_raid0 >> Stopped at xpt_rescan+0x1d: movq 0x10(%rsi),%rdx >> >> 1. Any thoughts on how to resolve the regression in the mpt driver with >> the r203889 commit? > > Any thoughts where to find a good telepath? :) > > For the beginning, show at least verbose boot messages up to the crash. > Full panic message could also be useful, it may show address of the > fault instruction, which may be resolved to source line with addr2line > tool. If you could find a good old PS/2 keyboard, backtrace would be > interesting to see. 2 issues: - The server is in colocated rack space and not easy to get to - I'm not even sure that this server has PS2 ports on it Perhaps this commit should be backed out of 8-STABLE until we get a chance to diagnose a bit more? Cheers, Lawrence