From owner-freebsd-scsi@FreeBSD.ORG  Tue Feb 10 07:14:15 2009
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 615F2106566B
	for <freebsd-scsi@freebsd.org>; Tue, 10 Feb 2009 07:14:15 +0000 (UTC)
	(envelope-from spork@bway.net)
Received: from xena.bway.net (xena.bway.net [216.220.96.26])
	by mx1.freebsd.org (Postfix) with ESMTP id 0FF348FC08
	for <freebsd-scsi@freebsd.org>; Tue, 10 Feb 2009 07:14:14 +0000 (UTC)
	(envelope-from spork@bway.net)
Received: (qmail 80459 invoked by uid 0); 10 Feb 2009 07:14:14 -0000
Received: from unknown (HELO toasty.nat.fasttrackmonkey.com)
	(spork@96.57.144.66)
	by smtp.bway.net with (DHE-RSA-AES256-SHA encrypted) SMTP;
	10 Feb 2009 07:14:14 -0000
Date: Tue, 10 Feb 2009 02:14:13 -0500 (EST)
From: Charles Sprickman <spork@bway.net>
X-X-Sender: spork@toasty.nat.fasttrackmonkey.com
To: Scott Long <scottl@samsco.org>
In-Reply-To: <alpine.OSX.2.00.0902100135290.37588@toasty.nat.fasttrackmonkey.com>
Message-ID: <alpine.OSX.2.00.0902100206490.37588@toasty.nat.fasttrackmonkey.com>
References: <alpine.OSX.2.00.0902100104170.37588@toasty.nat.fasttrackmonkey.com>
	<49911C68.6030203@samsco.org>
	<alpine.OSX.2.00.0902100135290.37588@toasty.nat.fasttrackmonkey.com>
User-Agent: Alpine 2.00 (OSX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-scsi@freebsd.org
Subject: Re: 7.1 Panic on degraded disk w/mpt
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Feb 2009 07:14:15 -0000

On Tue, 10 Feb 2009, Charles Sprickman wrote:

> On Mon, 9 Feb 2009, Scott Long wrote:
>
>> Charles Sprickman wrote:
>>> (posted on -stable already, no takers - added info: full dmesg, crash info 
>>> from panic when array finished rebuilding, some comments on dmesg)
>>> 
>>> Howdy,
>>> 
>>> I dug around and can't find a PR on this, and the only other report I saw 
>>> was in this mailing list post that has no replies:
>>> 
>>> http://www.nabble.com/7.1-BETA2-panic-on-mpt-degrade-td20183173.html
>>> 
>>> The hardware is a Dell PowerEdge 860 with the Dell/LSI SAS5 controller:
>>> 
>>> mpt0: <LSILogic SAS/SATA Adapter> port 0xec00-0xecff mem 
>>> 0xfe9fc000-0xfe9fffff,0xfe9e0000-0xfe9effff irq 16 at device 8.0 on pci2
>>> mpt0: MPI Version=1.5.13.0
>>> 
>>> The panic is repeatable by forcing the array into a degraded state.  When 
>>> the array finishes rebuilding, the box also panics.
>>> 
>>> Here's my best shot at getting info out of kgdb (panic on array going to 
>>> degraded state):
>> 
>> I wonder if the MPT card is temporarily detaching and then reattaching
>> the logical drive when the rebuild completes.
>
> IIRC, just before the panic there is a bunch of CAM debug splattered across 
> the monitor.  I can run down to the garage and snap a few pics of the monitor 
> after detaching a drive.

OK, some more info here.  I wanted to be safe, so I brought the machine 
down to single user and unmounted everything but /.  It did not panic on 
the drive being removed.  So perhaps a quiet filesystem = no panic.

Here's what gets spit out on the console:

mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
(mpt0:vol0:1): Physical Disk Status Changed
mpt0: mpt_cam_event: 0x15
mpt0: Unhandled Event Notify Frame. Event 0x15 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
(mpt0:vol0:1): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Volume Status Changed
mpt0: mpt_cam_event: 0x15
mpt0: Unhandled Event Notify Frame. Event 0x15 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
mpt0: mpt_cam_event: 0x15
mpt0: Unhandled Event Notify Frame. Event 0x15 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled )
(mpt0:vol0:1): No longer configured
(probe0:mpt0:1:0:0): error 22
(probe0:mpt0:1:0:0): Unretryable Error
(probe2:mpt0:1:2:0): error 22
(probe2:mpt0:1:2:0): Unretryable Error
(probe3:mpt0:1:3:0): error 22
(repeats with probe # increasing...)
(probe1:mpt0:1:1:0): CAM Status 0x19
(probe1:mpt0:1:1:0): Retrying Command
(probe0:mpt0:1:0:0): error 22
(probe0:mpt0:1:0:0): Unretryable Error
(pass1:mpt0:1:0:0): lost device
(pass1:mpt0:1:0:0): removing device entry

So it does appear that at the very least the mpt driver is removing the 
pass device for that drive, right?

And on reattach:

mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: Volume(0:1:0): Physical Disk Status Changed
mpt0: mpt_cam_event: 0x15
mpt0: Unhandled Event Notify Frame. Event 0x15 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
(mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:0:0)
(mpt0:vol0:1): Online
(mpt0:vol0:1): Status ( Out-Of-Sync )
(probe2:mpt0:1:2:0): error 22
(probe2:mpt0:1:2:0): Unretryable Error
(probe3:mpt0:1:3:0): error 22
(rinse, repeat)

pass1 at mpt0 bus 1 target 0 lun 0
pass1: <ATA ST3750640NS G> Fixed unknown SCSI-5 device
pass1: Serial Number             5QD56ZXC
pass1: 300.000MB/s transfers
pass1: Command Queueing Enabled
mpt0: mpt_cam_event: 0x15
mpt0: Unhandled Event Notify Frame. Event 0x15 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
mpt0: mpt_cam_event: 0x21
mpt0: Unhandled Event Notify Frame. Event 0x21 (ACK not required).
mpt0:vol0(mpt0:0:0): Volume Status Changed
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
mpt0:vol0(mpt0:0:0): High Priority Re-Sync
mpt0:vol0(mpt0:0:0): 1464842240 of 1464842240 blocks remaining

I'm betting it will panic again in a few hours when the rebuild finishes.

I'll try the detach again tomorrow with all the filesystems mounted and 
I'll make sure there's some pending writes when I detach.  If I see 
anything interesting before the panic message on screen, I'll grab it.

Thanks,

Charles