From owner-freebsd-scsi@FreeBSD.ORG  Thu Nov  8 00:35:20 2012
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 7B5ECCA3;
 Thu,  8 Nov 2012 00:35:20 +0000 (UTC)
 (envelope-from prvs=1659aa6059=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id 770088FC12;
 Thu,  8 Nov 2012 00:35:18 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50000985388.msg;
 Thu, 08 Nov 2012 00:35:17 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Thu, 08 Nov 2012 00:35:17 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1659aa6059=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
Message-ID: <0B4E8AFF9DA04C6EBD2496A8B58F1D67@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Doug Ambrisko" <ambrisko@ambrisko.com>
References: <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk>
 <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk>
 <20121105212911.GA17904@ambrisko.com>
 <27169C7FE704495087A093752D15E7B6@multiplay.co.uk>
 <20121106180152.GA40422@ambrisko.com>
 <6B5B65F4FC854EB8BBC701500096602E@multiplay.co.uk>
Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock
Date: Thu, 8 Nov 2012 00:35:22 -0000
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----=_NextPart_000_0112_01CDBD48.EFF94C40"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Cc: freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Nov 2012 00:35:20 -0000

This is a multi-part message in MIME format.

------=_NextPart_000_0112_01CDBD48.EFF94C40
Content-Type: text/plain;
	format=flowed;
	charset="iso-8859-1";
	reply-type=response
Content-Transfer-Encoding: 7bit

----- Original Message ----- 
From: "Steven Hartland"
>> On Tue, Nov 06, 2012 at 12:09:42AM -0000, Steven Hartland wrote:
>> | Thanks Doug, actually just finished another test run with some more
>> | debugging in and I believe I've found the reason for the non-recusive
>> | lock and at least some of the queuing issues.
>> | 
>> | The non-recursive lock is due to the mfi_tbolt_reset calling
>> | mfi_process_fw_state_chg_isr with mfi_io_lock held which in turn calls
>> | mfi_tbolt_init_MFI_queue which tries to acquire mfi_io_lock hence
>> | the problem.
>> | 
>> | mfi-lock.txt attached I believe fixes this as well as what appears
>> | to be an invalid call to mtx_unlock(&sc->mfi_io_lock) in mfi_attach
>> | which never acquires the lock as far as can see, possibly a cut and
>> | paste error.
>> 
>> I don't seem to see the attachment.
> 
> Yer seems like some mail fail by me there, but I've had some more locking
> panics during todays tests anyway, requiring additional fixes. Will update
> and post when I'm happy with it.

OK two patches attached
== zz-mfi-lock.patch ==
Fixes mfi panic on recused on non-recusive mutex MFI I/O lock

Removes a mtx_unlock call for mfi_io_lock which is never aquired

== zz-mfi-queue.patch ==
Fixes queuing issues where mfi_release_command blindly sets the cm_flags = 0
without first removing the command from the relavent queue.

This was causing panics in the queue functions which check to ensure a command
is not on another queue.

Also fixed some cases where the error from mfi_mapcmd was lost and where the
command was never released / dequeued in error cases.

Ensure that all failures to mfi_mapcmd are logged

Fixed possible null pointer exception in mfi_aen_setup if mfi_get_log_state
failed.

Fixed mfi_parse_entries & mfi_aen_setup not returning possible errors

Corrected MFI_DUMP_CMDS calls with invalid vars SC vs sc

Commands which have timed out now set cm_error to ETIMEDOUT and call
mfi_complete which prevents them getting stuck in the busy queue forever.

Fixed possible use of NULL pointer in mfi_tbolt_get_cmd

Changed output formats to be more easily recognisable when debugging.

A few style (9) fixes e.g. braced single line conditions and double blank
lines
----------

I've just had another panic, trace below, but it doesn't seem to be related
to my changes so I'd appreciate your feedback on them as they are for now.

While the lock patch fixes the problems I've seen, its not clear to me
why mfi_tbolt_reset is acquiring the lock and hence requiring
mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
around queue manipulation is done correctly. Given what its doing
(resetting the entire adapter) I wouldn't be surprised if it should
really be acquiring the config lock.

Other things I've noticed / questions
* Should mfi_abort sleep even if its call to mfi_mapcmd fails?
* Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
* Do these controllers not support none 512 byte requests? Currently
all syspd requests are done assuming 512 byte sectors which the disk may
not be. This will both reduce performance or potentially break totally
if the firmware isn't translating it under the surface correctly.

Anyway the new panic manually transcribed is:-
panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
...
mfi_tbolt_get_cmd()
mfi_build_mpt_pass_thru()
mfi_tbolt_build_mpt_cmd()
mfi_tbolt_send_frame()
bus_dmamap_load()
mfi_mapcmd()
mfi_startio()
mfi_syspd_strategy()
g_disk_start()
g_io_schedule_down()
g_down_proc_body()
fork_exit()
fork_trampoline()

Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
can tell all manip is done using the TAILQ macros and under mfi_io_lock
so its not obvious to me at this time why this is, any ideas?

There was an obvious error in mfi_tbolt_get_cmd which is now fixed in the
queue patch, where cmd can be used even if queue was empty and TAILQ_FIRST
returned NULL, but I can't see this causing this panic.

This is running with a debug kernel with:-
options         WITNESS
options         INVARIANTS
options         INVARIANT_SUPPORT
options         DDB
options         GDB
options         PRINTF_BUFR_SIZE=2048
options         MFI_DEBUG

Unfortunately I've only got this hardware till Friday unfortunately so any
ideas would be most appreciated so I can get testing done before then.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.
------=_NextPart_000_0112_01CDBD48.EFF94C40
Content-Type: application/octet-stream;
	name="zz-mfi-lock.patch"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="zz-mfi-lock.patch"

Fixes mfi panic on recused on non-recusive mutex MFI I/O lock=0A=
=0A=
Removes a mtx_unlock call for mfi_io_lock which is never aquired=0A=
--- sys/dev/mfi/mfi.c.orig	2012-11-07 14:40:33.960774577 +0000=0A=
+++ sys/dev/mfi/mfi.c	2012-11-07 14:50:28.267789676 +0000=0A=
@@ -728,10 +728,8 @@=0A=
 		    "hook\n");=0A=
 		return (EINVAL);=0A=
 	}=0A=
-	if ((error =3D mfi_aen_setup(sc, 0), 0) !=3D 0) {=0A=
-		mtx_unlock(&sc->mfi_io_lock);=0A=
+	if ((error =3D mfi_aen_setup(sc, 0), 0) !=3D 0)=0A=
 		return (error);=0A=
-	}=0A=
 =0A=
 	/*=0A=
 	 * Register a shutdown handler.=0A=
--- sys/dev/mfi/mfi_tbolt.c.orig	2012-11-07 12:21:56.249116533 +0000=0A=
+++ sys/dev/mfi/mfi_tbolt.c	2012-11-07 14:50:28.268789748 +0000=0A=
@@ -1194,6 +1194,7 @@=0A=
 			sc->hw_crit_error=3D 1;=0A=
 			return ;=0A=
 		}=0A=
+		mtx_unlock(&sc->mfi_io_lock);=0A=
 		if ((error =3D mfi_tbolt_init_MFI_queue(sc)) !=3D 0)=0A=
 				return;=0A=
 =0A=
@@ -1225,7 +1226,9 @@=0A=
 			/*=0A=
 			 * Initiate AEN (Asynchronous Event Notification)=0A=
 			 */=0A=
+			mtx_unlock(&sc->mfi_io_lock);=0A=
 			mfi_aen_setup(sc, sc->last_seq_num);=0A=
+			mtx_lock(&sc->mfi_io_lock);=0A=
 			sc->issuepend_done =3D 1;=0A=
 			device_printf(sc->mfi_dev, "second stage of reset "=0A=
 			    "complete, FW is ready now.\n");=0A=
@@ -1237,7 +1240,6 @@=0A=
 		device_printf(sc->mfi_dev, "mfi_process_fw_state_chg_isr "=0A=
 		    "called with unhandled value:%d\n", sc->adpreset);=0A=
 	}=0A=
-	mtx_unlock(&sc->mfi_io_lock);=0A=
 }=0A=
 =0A=
 /*=0A=

------=_NextPart_000_0112_01CDBD48.EFF94C40
Content-Type: application/octet-stream;
	name="zz-mfi-queue.patch"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="zz-mfi-queue.patch"

Fixes queuing issues where mfi_release_command blindly sets the cm_flags =
=3D 0=0A=
without first removing the command from the relavent queue.=0A=
=0A=
This was causing panics in the queue functions which check to ensure a =
command=0A=
is not on another queue.=0A=
=0A=
Also fixed some cases where the error from mfi_mapcmd was lost and where =
the=0A=
command was never released / dequeued in error cases.=0A=
=0A=
Ensure that all failures to mfi_mapcmd are logged=0A=
=0A=
Fixed possible null pointer exception in mfi_aen_setup if =
mfi_get_log_state=0A=
failed.=0A=
=0A=
Fixed mfi_parse_entries & mfi_aen_setup not returning possible errors=0A=
=0A=
Corrected MFI_DUMP_CMDS calls with invalid vars SC vs sc=0A=
=0A=
Commands which have timed out now set cm_error to ETIMEDOUT and call=0A=
mfi_complete which prevents them getting stuck in the busy queue forever.=0A=
=0A=
Fixed possible use of NULL pointer in mfi_tbolt_get_cmd=0A=
=0A=
Changed output formats to be more easily recognisable when debugging.=0A=
=0A=
A few style (9) fixes e.g. braced single line conditions and double =
blank lines=0A=
--- sys/dev/mfi/mfi.c.orig	2012-11-07 14:50:28.267789676 +0000=0A=
+++ sys/dev/mfi/mfi.c	2012-11-07 14:55:16.768525352 +0000=0A=
@@ -837,6 +837,23 @@=0A=
 		cm->cm_sg->sg32[0].addr =3D 0;=0A=
 	}=0A=
 =0A=
+	/*=0A=
+	 * Command may be on other queues e.g. busy queue depending on the=0A=
+	 * flow of a previous call to mfi_mapcmd, so ensure its dequeued=0A=
+	 * properly=0A=
+	 */=0A=
+	if ((cm->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0)=0A=
+		mfi_remove_busy(cm);=0A=
+	if ((cm->cm_flags & MFI_ON_MFIQ_READY) !=3D 0)=0A=
+		mfi_remove_ready(cm);=0A=
+=0A=
+	/* We're not expecting it to be on any other queue but check */=0A=
+	if ((cm->cm_flags & MFI_ON_MFIQ_MASK) !=3D 0) {=0A=
+		printf("command %p is still on another queue, flags =3D %#x\n",=0A=
+		    cm, cm->cm_flags);=0A=
+		panic("command is still on a queue");=0A=
+	}=0A=
+=0A=
 	hdr_data =3D (uint32_t *)cm->cm_frame;=0A=
 	hdr_data[0] =3D 0;	/* cmd, sense_len, cmd_status, scsi_status */=0A=
 	hdr_data[1] =3D 0;	/* target_id, lun_id, cdb_len, sg_count */=0A=
@@ -950,15 +967,12 @@=0A=
 	cm->cm_data =3D NULL;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED;=0A=
 =0A=
-	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0) {=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
 		device_printf(sc->mfi_dev, "failed to send init command\n");=0A=
-		mtx_unlock(&sc->mfi_io_lock);=0A=
-		return (error);=0A=
-	}=0A=
 	mfi_release_command(cm);=0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
 =0A=
-	return (0);=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 static int=0A=
@@ -1046,27 +1060,26 @@=0A=
 	class_locale.members.evt_class  =3D mfi_event_class;=0A=
 =0A=
 	if (seq_start =3D=3D 0) {=0A=
-		error =3D mfi_get_log_state(sc, &log_state);=0A=
+		if((error =3D mfi_get_log_state(sc, &log_state)) !=3D 0)=0A=
+			goto out;=0A=
 		sc->mfi_boot_seq_num =3D log_state->boot_seq_num;=0A=
-		if (error) {=0A=
-			if (log_state)=0A=
-				free(log_state, M_MFIBUF);=0A=
-			return (error);=0A=
-		}=0A=
 =0A=
 		/*=0A=
 		 * Walk through any events that fired since the last=0A=
 		 * shutdown.=0A=
 		 */=0A=
-		mfi_parse_entries(sc, log_state->shutdown_seq_num,=0A=
-		    log_state->newest_seq_num);=0A=
+		if((error =3D mfi_parse_entries(sc, log_state->shutdown_seq_num,=0A=
+		    log_state->newest_seq_num)) !=3D 0)=0A=
+			goto out;=0A=
 		seq =3D log_state->newest_seq_num;=0A=
 	} else=0A=
 		seq =3D seq_start;=0A=
-	mfi_aen_register(sc, seq, class_locale.word);=0A=
-	free(log_state, M_MFIBUF);=0A=
+	error =3D mfi_aen_register(sc, seq, class_locale.word);=0A=
+out:=0A=
+	if (log_state)=0A=
+		free(log_state, M_MFIBUF);=0A=
 =0A=
-	return 0;=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 int=0A=
@@ -1076,7 +1089,6 @@=0A=
 	mtx_assert(&sc->mfi_io_lock, MA_OWNED);=0A=
 	cm->cm_complete =3D NULL;=0A=
 =0A=
-=0A=
 	/*=0A=
 	 * MegaCli can issue a DCMD of 0.  In this case do nothing=0A=
 	 * and return 0 to it as status=0A=
@@ -1310,9 +1322,8 @@=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED;=0A=
 	cm->cm_data =3D NULL;=0A=
 =0A=
-	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0) {=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
 		device_printf(sc->mfi_dev, "Failed to shutdown controller\n");=0A=
-	}=0A=
 =0A=
 	mfi_release_command(cm);=0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
@@ -1796,6 +1807,7 @@=0A=
 			mtx_lock(&sc->mfi_io_lock);=0A=
 			mfi_release_command(cm);=0A=
 			mtx_unlock(&sc->mfi_io_lock);=0A=
+			error =3D EIO;=0A=
 			break;=0A=
 		}=0A=
 		mtx_lock(&sc->mfi_io_lock);=0A=
@@ -1824,7 +1836,7 @@=0A=
 	}=0A=
 =0A=
 	free(el, M_MFIBUF);=0A=
-	return (0);=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 static int=0A=
@@ -1941,11 +1953,12 @@=0A=
 	dcmd->mbox[0]=3Did;=0A=
 	dcmd->header.scsi_status =3D 0;=0A=
 	dcmd->header.pad0 =3D 0;=0A=
-	if (mfi_mapcmd(sc, cm) !=3D 0) {=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0) {=0A=
 		device_printf(sc->mfi_dev,=0A=
 		    "Failed to get physical drive info %d\n", id);=0A=
 		free(pd_info, M_MFIBUF);=0A=
-		return (0);=0A=
+		mfi_release_command(cm);=0A=
+		return (error);=0A=
 	}=0A=
 	bus_dmamap_sync(sc->mfi_buffer_dmat, cm->cm_dmamap,=0A=
 	    BUS_DMASYNC_POSTREAD);=0A=
@@ -2211,11 +2224,14 @@=0A=
 	if ((hdr->cmd_status !=3D MFI_STAT_OK) || (hdr->scsi_status !=3D 0)) {=0A=
 		bio->bio_flags |=3D BIO_ERROR;=0A=
 		bio->bio_error =3D EIO;=0A=
-		device_printf(sc->mfi_dev, "I/O error, status=3D %d "=0A=
-		    "scsi_status=3D %d\n", hdr->cmd_status, hdr->scsi_status);=0A=
+		device_printf(sc->mfi_dev, "I/O error, status=3D%#x "=0A=
+		    "scsi_status=3D%#x\n", hdr->cmd_status, hdr->scsi_status);=0A=
 		mfi_print_sense(cm->cm_sc, cm->cm_sense);=0A=
 	} else if (cm->cm_error !=3D 0) {=0A=
 		bio->bio_flags |=3D BIO_ERROR;=0A=
+		bio->bio_error =3D cm->cm_error;=0A=
+		device_printf(sc->mfi_dev, "I/O error, error=3D%#x\n",=0A=
+		    cm->cm_error);=0A=
 	}=0A=
 =0A=
 	mfi_release_command(cm);=0A=
@@ -2251,6 +2267,9 @@=0A=
 =0A=
 		/* Send the command to the controller */=0A=
 		if (mfi_mapcmd(sc, cm) !=3D 0) {=0A=
+			device_printf(sc->mfi_dev, "Failed to startio\n");=0A=
+			if ((cm->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0)=0A=
+				mfi_remove_busy(cm);=0A=
 			mfi_requeue_ready(cm);=0A=
 			break;=0A=
 		}=0A=
@@ -2374,7 +2393,7 @@=0A=
 	cm->cm_extra_frames =3D (cm->cm_total_frame_size - 1) / MFI_FRAME_SIZE;=0A=
 =0A=
 	if (sc->MFA_enabled)=0A=
-			mfi_tbolt_send_frame(sc, cm);=0A=
+		mfi_tbolt_send_frame(sc, cm);=0A=
 	else=0A=
 		mfi_send_frame(sc, cm);=0A=
 =0A=
@@ -2466,7 +2485,7 @@=0A=
 {=0A=
 	struct mfi_command *cm;=0A=
 	struct mfi_abort_frame *abort;=0A=
-	int i =3D 0;=0A=
+	int i =3D 0, error;=0A=
 	uint32_t context =3D 0;=0A=
 =0A=
 	mtx_lock(&sc->mfi_io_lock);=0A=
@@ -2490,7 +2509,8 @@=0A=
 	cm->cm_data =3D NULL;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED;=0A=
 =0A=
-	mfi_mapcmd(sc, cm);=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
+		device_printf(sc->mfi_dev, "failed to abort command\n");=0A=
 	mfi_release_command(cm);=0A=
 =0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
@@ -2506,7 +2526,7 @@=0A=
 		mtx_unlock(&sc->mfi_io_lock);=0A=
 	}=0A=
 =0A=
-	return (0);=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 int=0A=
@@ -2544,7 +2564,8 @@=0A=
 	cm->cm_total_frame_size =3D MFI_IO_FRAME_SIZE;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED | MFI_CMD_DATAOUT;=0A=
 =0A=
-	error =3D mfi_mapcmd(sc, cm);=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
+		device_printf(sc->mfi_dev, "failed dump blocks\n");=0A=
 	bus_dmamap_sync(sc->mfi_buffer_dmat, cm->cm_dmamap,=0A=
 	    BUS_DMASYNC_POSTWRITE);=0A=
 	bus_dmamap_unload(sc->mfi_buffer_dmat, cm->cm_dmamap);=0A=
@@ -2587,7 +2608,8 @@=0A=
 	cm->cm_total_frame_size =3D MFI_PASS_FRAME_SIZE;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED | MFI_CMD_DATAOUT | MFI_CMD_SCSI;=0A=
 =0A=
-	error =3D mfi_mapcmd(sc, cm);=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
+		device_printf(sc->mfi_dev, "failed dump blocks\n");=0A=
 	bus_dmamap_sync(sc->mfi_buffer_dmat, cm->cm_dmamap,=0A=
 	    BUS_DMASYNC_POSTWRITE);=0A=
 	bus_dmamap_unload(sc->mfi_buffer_dmat, cm->cm_dmamap);=0A=
@@ -3643,7 +3665,7 @@=0A=
 =0A=
 #if 0=0A=
 		if (timedout)=0A=
-			MFI_DUMP_CMDS(SC);=0A=
+			MFI_DUMP_CMDS(sc);=0A=
 #endif=0A=
 =0A=
 		mtx_unlock(&sc->mfi_io_lock);=0A=
@@ -3656,7 +3678,7 @@=0A=
 mfi_timeout(void *data)=0A=
 {=0A=
 	struct mfi_softc *sc =3D (struct mfi_softc *)data;=0A=
-	struct mfi_command *cm;=0A=
+	struct mfi_command *cm, *tmp;=0A=
 	time_t deadline;=0A=
 	int timedout =3D 0;=0A=
 =0A=
@@ -3669,7 +3691,7 @@=0A=
 		}=0A=
 	}=0A=
 	mtx_lock(&sc->mfi_io_lock);=0A=
-	TAILQ_FOREACH(cm, &sc->mfi_busy, cm_link) {=0A=
+	TAILQ_FOREACH_SAFE(cm, &sc->mfi_busy, cm_link, tmp) {=0A=
 		if (sc->mfi_aen_cm =3D=3D cm || sc->mfi_map_sync_cm =3D=3D cm)=0A=
 			continue;=0A=
 		if (cm->cm_timestamp < deadline) {=0A=
@@ -3682,6 +3704,13 @@=0A=
 				     );=0A=
 				MFI_PRINT_CMD(cm);=0A=
 				MFI_VALIDATE_CMD(sc, cm);=0A=
+				/*=0A=
+				 * Fail the command instead of leaving it on=0A=
+				 * the queue where it could remain stuck forever=0A=
+				 */=0A=
+				mfi_remove_busy(cm);=0A=
+				cm->cm_error =3D ETIMEDOUT;=0A=
+				mfi_complete(sc, cm);=0A=
 				timedout++;=0A=
 			}=0A=
 		}=0A=
@@ -3689,7 +3718,7 @@=0A=
 =0A=
 #if 0=0A=
 	if (timedout)=0A=
-		MFI_DUMP_CMDS(SC);=0A=
+		MFI_DUMP_CMDS(sc);=0A=
 #endif=0A=
 =0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
--- sys/dev/mfi/mfi_tbolt.c.orig	2012-11-07 23:00:24.542124476 +0000=0A=
+++ sys/dev/mfi/mfi_tbolt.c	2012-11-07 23:01:46.848207655 +0000=0A=
@@ -162,14 +162,14 @@=0A=
 	while (!( HostDiag & DIAG_WRITE_ENABLE)) {=0A=
 		for (i =3D 0; i < 1000; i++);=0A=
 		HostDiag =3D (uint32_t)MFI_READ4(sc, MFI_HDR);=0A=
-		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%x, "=0A=
-		    "hostdiag=3D%x\n", retry, HostDiag);=0A=
+		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%d, "=0A=
+		    "hostdiag=3D%#x\n", retry, HostDiag);=0A=
 =0A=
 		if (retry++ >=3D 100)=0A=
 			return 1;=0A=
 	}=0A=
 =0A=
-	device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: HostDiag=3D%x\n", =
HostDiag);=0A=
+	device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: HostDiag=3D%#x\n", =
HostDiag);=0A=
 =0A=
 	MFI_WRITE4(sc, MFI_HDR, (HostDiag | DIAG_RESET_ADAPTER));=0A=
 =0A=
@@ -181,8 +181,8 @@=0A=
 	while (HostDiag & DIAG_RESET_ADAPTER) {=0A=
 		for (i =3D 0; i < 1000; i++) ;=0A=
 		HostDiag =3D (uint32_t)MFI_READ4(sc, MFI_RSR);=0A=
-		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%x, "=0A=
-		    "hostdiag=3D%x\n", retry, HostDiag);=0A=
+		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%d, "=0A=
+		    "hostdiag=3D%#x\n", retry, HostDiag);=0A=
 =0A=
 		if (retry++ >=3D 1000)=0A=
 			return 1;=0A=
@@ -734,7 +734,8 @@=0A=
 =0A=
 	mtx_assert(&sc->mfi_io_lock, MA_OWNED);=0A=
 =0A=
-	cmd =3D TAILQ_FIRST(&sc->mfi_cmd_tbolt_tqh);=0A=
+	if ((cmd =3D TAILQ_FIRST(&sc->mfi_cmd_tbolt_tqh)) =3D=3D NULL)=0A=
+		return NULL;=0A=
 	TAILQ_REMOVE(&sc->mfi_cmd_tbolt_tqh, cmd, next);=0A=
 	memset((uint8_t *)cmd->sg_frame, 0, MEGASAS_MAX_SZ_CHAIN_FRAME);=0A=
 	memset((uint8_t *)cmd->io_request, 0,=0A=
@@ -1119,9 +1120,9 @@=0A=
 		 * should be performed on the controller=0A=
 		 */=0A=
 		if (cm->retry_for_fw_reset =3D=3D 3) {=0A=
-			device_printf(sc->mfi_dev, "megaraid_sas: command %d "=0A=
-			    "was tried multiple times during adapter reset"=0A=
-			    "Shutting down the HBA\n", cm->cm_index);=0A=
+			device_printf(sc->mfi_dev, "megaraid_sas: command %p "=0A=
+			    "index=3D%d was tried multiple times during adapter "=0A=
+			    "reset - Shutting down the HBA\n", cm, cm->cm_index);=0A=
 			mfi_kill_hba(sc);=0A=
 			sc->hw_crit_error =3D 1;=0A=
 			return;=0A=
@@ -1137,8 +1138,8 @@=0A=
 				if (cm->cm_frame->dcmd.opcode !=3D=0A=
 				    MFI_DCMD_CTRL_EVENT_WAIT) {=0A=
 					device_printf(sc->mfi_dev,=0A=
-					    "APJ ****requeue command %d \n",=0A=
-					    cm->cm_index);=0A=
+					    "APJ ****requeue command %p "=0A=
+					    "index=3D%d\n", cm, cm->cm_index);=0A=
 					mfi_requeue_ready(cm);=0A=
 				}=0A=
 			}=0A=
@@ -1357,6 +1358,8 @@=0A=
 		device_printf(sc->mfi_dev, "failed to send map sync\n");=0A=
 		free(ld_sync, M_MFIBUF);=0A=
 		sc->mfi_map_sync_cm =3D NULL;=0A=
+		if ((cmd->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0)=0A=
+			mfi_remove_busy(cmd);=0A=
 		mfi_requeue_ready(cmd);=0A=
 		goto out;=0A=
 	}=0A=

------=_NextPart_000_0112_01CDBD48.EFF94C40--