Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 9 Nov 2012 17:06:03 -0000
From:      "Steven Hartland" <killing@multiplay.co.uk>
To:        "Steven Hartland" <killing@multiplay.co.uk>, "Doug Ambrisko" <ambrisko@ambrisko.com>
Cc:        freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org
Subject:   Re: mfi panic on recused on non-recusive mutex MFI I/O lock
Message-ID:  <F46B51033DB84937AEEC8F4A95211DAB@multiplay.co.uk>
References:  <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk> <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk> <20121105212911.GA17904@ambrisko.com> <27169C7FE704495087A093752D15E7B6@multiplay.co.uk> <20121106180152.GA40422@ambrisko.com> <6B5B65F4FC854EB8BBC701500096602E@multiplay.co.uk> <0B4E8AFF9DA04C6EBD2496A8B58F1D67@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.

------=_NextPart_000_0276_01CDBE9C.80113780
Content-Type: text/plain;
	format=flowed;
	charset="iso-8859-1";
	reply-type=original
Content-Transfer-Encoding: 7bit


----- Original Message ----- 
From: "Steven Hartland"
...
> I've just had another panic, trace below, but it doesn't seem to be related
> to my changes so I'd appreciate your feedback on them as they are for now.
> 
> While the lock patch fixes the problems I've seen, its not clear to me
> why mfi_tbolt_reset is acquiring the lock and hence requiring
> mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
> around queue manipulation is done correctly. Given what its doing
> (resetting the entire adapter) I wouldn't be surprised if it should
> really be acquiring the config lock.
> 
> Other things I've noticed / questions
> * Should mfi_abort sleep even if its call to mfi_mapcmd fails?
> * Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
> * Do these controllers not support none 512 byte requests? Currently
> all syspd requests are done assuming 512 byte sectors which the disk may
> not be. This will both reduce performance or potentially break totally
> if the firmware isn't translating it under the surface correctly.
> 
> Anyway the new panic manually transcribed is:-
> panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
> ...
> mfi_tbolt_get_cmd()
> mfi_build_mpt_pass_thru()
> mfi_tbolt_build_mpt_cmd()
> mfi_tbolt_send_frame()
> bus_dmamap_load()
> mfi_mapcmd()
> mfi_startio()
> mfi_syspd_strategy()
> g_disk_start()
> g_io_schedule_down()
> g_down_proc_body()
> fork_exit()
> fork_trampoline()
> 
> Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
> can tell all manip is done using the TAILQ macros and under mfi_io_lock
> so its not obvious to me at this time why this is, any ideas?

I've gone through looking for the possible cause of this and while there's
nothing directly connected to the manip of this queue I've found and fixed
quite a large number of additional problems which may have been indirectly
causing this problem.

The biggest change is to use mfi_max_cmds to limit the value stored in
sc->mfi_max_fw_cmds as this is used extensively throughout the driver
for allocation and range checks so having this inconsitently set opened up
a large number of possible overrun errors.

The new patch attached documents all the changes in detail.

I've managed to do one test run so far which failed to reproduce any panics,
so definitely moving in the right direction :)

The machine has now been collected for repair by the supplier but I'm going
to try and get them to put it online for more testing over the weekend.

Given the failure rate so far if I can do another 4 runs with no panics I'd
be happy that the majority of error conditions are working as expected.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.
------=_NextPart_000_0276_01CDBE9C.80113780
Content-Type: application/octet-stream;
	name="zz-mfi-queue.patch"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="zz-mfi-queue.patch"

Fixes queuing issues where mfi_release_command blindly sets the cm_flags =
=3D 0=0A=
without first removing the command from the relavent queue.=0A=
=0A=
This was causing panics in the queue functions which check to ensure a =
command=0A=
is not on another queue.=0A=
=0A=
Fixed some cases where the error from mfi_mapcmd was lost and where the =
command=0A=
was never released / dequeued in error cases.=0A=
=0A=
Ensure that all failures to mfi_mapcmd are logged.=0A=
=0A=
Fixed possible null pointer exception in mfi_aen_setup if =
mfi_get_log_state=0A=
failed.=0A=
=0A=
Fixed mfi_parse_entries & mfi_aen_setup not returning possible errors.=0A=
=0A=
Corrected MFI_DUMP_CMDS calls with invalid vars SC vs sc.=0A=
=0A=
Commands which have timed out now set cm_error to ETIMEDOUT and call=0A=
mfi_complete which prevents them getting stuck in the busy queue forever.=0A=
=0A=
Fixed possible use of NULL pointer in mfi_tbolt_get_cmd.=0A=
=0A=
Changed output formats to be more easily recognisable when debugging.=0A=
=0A=
Optimised mfi_cmd_pool_tbolt cleanup.=0A=
=0A=
Made information about driver limiting commands always display as for =
modern=0A=
cards this can be severe.=0A=
=0A=
Fixed mfi_tbolt_alloc_cmd out of memory case which previously didnt =
return an=0A=
error.=0A=
=0A=
Added malloc checks for request_desc_pool including free when subsiquent =
errors=0A=
are detected.=0A=
=0A=
Fixed overflow error in SIMD reply descriptor check.=0A=
=0A=
Fixed tbolt_cmd leak in mfi_build_and_issue_cmd if there's an error =
during IO=0A=
build.=0A=
=0A=
Elimintated double checks on sc->mfi_aen_cm & sc->mfi_map_sync_cm in=0A=
mfi_shutdown.=0A=
=0A=
Move local hdr calculation after error check in mfi_aen_complete.=0A=
=0A=
Fixed wakeup on NULL in mfi_aen_complete.=0A=
=0A=
Fixed mfi_aen_cm cleanup in mfi_process_fw_state_chg_isr not checking if =
it was=0A=
NULL.=0A=
=0A=
Changed mfi_alloc_commands to error if bus_dmamap_create fails. =
Previously we=0A=
would try to continue with the number of allocated commands but lots of =
places=0A=
in the driver assume sc->mfi_max_fw_cmds is whats available so its =
unsafe to do=0A=
this without lots of changes.=0A=
=0A=
Removed mfi_total_cmds as its no longer used due the above change.=0A=
=0A=
Corrected mfi_tbolt_alloc_cmd to return ENOMEM where appropriate.=0A=
=0A=
Fixed timeouts actually firing at double what they should.=0A=
=0A=
Setting hw.mfi.max_cmds=3D-1 now configures to use the controller max.=0A=
=0A=
A few style (9) fixes e.g. braced single line conditions and double =
blank lines=0A=
--- sys/dev/mfi/mfi.c.orig	2012-11-07 23:00:24.540112877 +0000=0A=
+++ sys/dev/mfi/mfi.c	2012-11-09 14:10:12.288110349 +0000=0A=
@@ -143,7 +143,7 @@=0A=
 static int	mfi_max_cmds =3D 128;=0A=
 TUNABLE_INT("hw.mfi.max_cmds", &mfi_max_cmds);=0A=
 SYSCTL_INT(_hw_mfi, OID_AUTO, max_cmds, CTLFLAG_RD, &mfi_max_cmds,=0A=
-	   0, "Max commands");=0A=
+	   0, "Max commands limit (-1 =3D controller limit)");=0A=
 =0A=
 static int	mfi_detect_jbod_change =3D 1;=0A=
 TUNABLE_INT("hw.mfi.detect_jbod_change", &mfi_detect_jbod_change);=0A=
@@ -366,7 +366,7 @@=0A=
 {=0A=
 	uint32_t status;=0A=
 	int error, commsz, framessz, sensesz;=0A=
-	int frames, unit, max_fw_sge;=0A=
+	int frames, unit, max_fw_sge, max_fw_cmds;=0A=
 	uint32_t tb_mem_size =3D 0;=0A=
 =0A=
 	if (sc =3D=3D NULL)=0A=
@@ -461,7 +461,13 @@=0A=
 	 * instead of compile time.=0A=
 	 */=0A=
 	status =3D sc->mfi_read_fw_status(sc);=0A=
-	sc->mfi_max_fw_cmds =3D status & MFI_FWSTATE_MAXCMD_MASK;=0A=
+	max_fw_cmds =3D status & MFI_FWSTATE_MAXCMD_MASK;=0A=
+	if (mfi_max_cmds > 0 && mfi_max_cmds < max_fw_cmds) {=0A=
+		device_printf(sc->mfi_dev, "FW MaxCmds =3D %d, limiting to %d\n",=0A=
+		    max_fw_cmds, mfi_max_cmds);=0A=
+		    sc->mfi_max_fw_cmds =3D mfi_max_cmds;=0A=
+	} else=0A=
+		sc->mfi_max_fw_cmds =3D max_fw_cmds;=0A=
 	max_fw_sge =3D (status & MFI_FWSTATE_MAXSGL_MASK) >> 16;=0A=
 	sc->mfi_max_sge =3D min(max_fw_sge, ((MFI_MAXPHYS / PAGE_SIZE) + 1));=0A=
 =0A=
@@ -469,7 +475,8 @@=0A=
 =0A=
 	if (sc->mfi_flags & MFI_FLAGS_TBOLT) {=0A=
 		mfi_tbolt_init_globals(sc);=0A=
-		device_printf(sc->mfi_dev, "MaxCmd =3D %x MaxSgl =3D %x state =3D %x =
\n",=0A=
+		device_printf(sc->mfi_dev, "MaxCmd =3D %d, Drv MaxCmd =3D %d, "=0A=
+		    "MaxSgl =3D %d, state =3D %#x\n", max_fw_cmds,=0A=
 		    sc->mfi_max_fw_cmds, sc->mfi_max_sge, status);=0A=
 		tb_mem_size =3D mfi_tbolt_get_memory_requirement(sc);=0A=
 =0A=
@@ -779,21 +786,16 @@=0A=
 mfi_alloc_commands(struct mfi_softc *sc)=0A=
 {=0A=
 	struct mfi_command *cm;=0A=
-	int i, ncmds;=0A=
+	int i, j;=0A=
 =0A=
 	/*=0A=
 	 * XXX Should we allocate all the commands up front, or allocate on=0A=
 	 * demand later like 'aac' does?=0A=
 	 */=0A=
-	ncmds =3D MIN(mfi_max_cmds, sc->mfi_max_fw_cmds);=0A=
-	if (bootverbose)=0A=
-		device_printf(sc->mfi_dev, "Max fw cmds=3D %d, sizing driver "=0A=
-		   "pool to %d\n", sc->mfi_max_fw_cmds, ncmds);=0A=
-=0A=
-	sc->mfi_commands =3D malloc(sizeof(struct mfi_command) * ncmds, =
M_MFIBUF,=0A=
-	    M_WAITOK | M_ZERO);=0A=
+	sc->mfi_commands =3D malloc(sizeof(struct mfi_command) *=0A=
+	    sc->mfi_max_fw_cmds, M_MFIBUF, M_WAITOK | M_ZERO);=0A=
 =0A=
-	for (i =3D 0; i < ncmds; i++) {=0A=
+	for (i =3D 0; i < sc->mfi_max_fw_cmds; i++) {=0A=
 		cm =3D &sc->mfi_commands[i];=0A=
 		cm->cm_frame =3D (union mfi_frame *)((uintptr_t)sc->mfi_frames +=0A=
 		    sc->mfi_cmd_size * i);=0A=
@@ -809,10 +811,20 @@=0A=
 			mtx_lock(&sc->mfi_io_lock);=0A=
 			mfi_release_command(cm);=0A=
 			mtx_unlock(&sc->mfi_io_lock);=0A=
+		} else {=0A=
+			device_printf(sc->mfi_dev, "Failed to allocate %d "=0A=
+			   "command blocks, only allocated %d\n",=0A=
+			    sc->mfi_max_fw_cmds, i - 1);=0A=
+			for (j =3D 0; j < i; j++) {=0A=
+				cm =3D &sc->mfi_commands[i];=0A=
+				bus_dmamap_destroy(sc->mfi_buffer_dmat,=0A=
+				    cm->cm_dmamap);=0A=
+			}=0A=
+			free(sc->mfi_commands, M_MFIBUF);=0A=
+			sc->mfi_commands =3D NULL;=0A=
+				=0A=
+			return (ENOMEM);=0A=
 		}=0A=
-		else=0A=
-			break;=0A=
-		sc->mfi_total_cmds++;=0A=
 	}=0A=
 =0A=
 	return (0);=0A=
@@ -837,6 +849,23 @@=0A=
 		cm->cm_sg->sg32[0].addr =3D 0;=0A=
 	}=0A=
 =0A=
+	/*=0A=
+	 * Command may be on other queues e.g. busy queue depending on the=0A=
+	 * flow of a previous call to mfi_mapcmd, so ensure its dequeued=0A=
+	 * properly=0A=
+	 */=0A=
+	if ((cm->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0)=0A=
+		mfi_remove_busy(cm);=0A=
+	if ((cm->cm_flags & MFI_ON_MFIQ_READY) !=3D 0)=0A=
+		mfi_remove_ready(cm);=0A=
+=0A=
+	/* We're not expecting it to be on any other queue but check */=0A=
+	if ((cm->cm_flags & MFI_ON_MFIQ_MASK) !=3D 0) {=0A=
+		printf("command %p is still on another queue, flags =3D %#x\n",=0A=
+		    cm, cm->cm_flags);=0A=
+		panic("command is still on a queue");=0A=
+	}=0A=
+=0A=
 	hdr_data =3D (uint32_t *)cm->cm_frame;=0A=
 	hdr_data[0] =3D 0;	/* cmd, sense_len, cmd_status, scsi_status */=0A=
 	hdr_data[1] =3D 0;	/* target_id, lun_id, cdb_len, sg_count */=0A=
@@ -950,15 +979,12 @@=0A=
 	cm->cm_data =3D NULL;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED;=0A=
 =0A=
-	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0) {=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
 		device_printf(sc->mfi_dev, "failed to send init command\n");=0A=
-		mtx_unlock(&sc->mfi_io_lock);=0A=
-		return (error);=0A=
-	}=0A=
 	mfi_release_command(cm);=0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
 =0A=
-	return (0);=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 static int=0A=
@@ -1046,27 +1072,26 @@=0A=
 	class_locale.members.evt_class  =3D mfi_event_class;=0A=
 =0A=
 	if (seq_start =3D=3D 0) {=0A=
-		error =3D mfi_get_log_state(sc, &log_state);=0A=
+		if((error =3D mfi_get_log_state(sc, &log_state)) !=3D 0)=0A=
+			goto out;=0A=
 		sc->mfi_boot_seq_num =3D log_state->boot_seq_num;=0A=
-		if (error) {=0A=
-			if (log_state)=0A=
-				free(log_state, M_MFIBUF);=0A=
-			return (error);=0A=
-		}=0A=
 =0A=
 		/*=0A=
 		 * Walk through any events that fired since the last=0A=
 		 * shutdown.=0A=
 		 */=0A=
-		mfi_parse_entries(sc, log_state->shutdown_seq_num,=0A=
-		    log_state->newest_seq_num);=0A=
+		if((error =3D mfi_parse_entries(sc, log_state->shutdown_seq_num,=0A=
+		    log_state->newest_seq_num)) !=3D 0)=0A=
+			goto out;=0A=
 		seq =3D log_state->newest_seq_num;=0A=
 	} else=0A=
 		seq =3D seq_start;=0A=
-	mfi_aen_register(sc, seq, class_locale.word);=0A=
-	free(log_state, M_MFIBUF);=0A=
+	error =3D mfi_aen_register(sc, seq, class_locale.word);=0A=
+out:=0A=
+	if (log_state)=0A=
+		free(log_state, M_MFIBUF);=0A=
 =0A=
-	return 0;=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 int=0A=
@@ -1076,7 +1101,6 @@=0A=
 	mtx_assert(&sc->mfi_io_lock, MA_OWNED);=0A=
 	cm->cm_complete =3D NULL;=0A=
 =0A=
-=0A=
 	/*=0A=
 	 * MegaCli can issue a DCMD of 0.  In this case do nothing=0A=
 	 * and return 0 to it as status=0A=
@@ -1104,12 +1128,13 @@=0A=
 	if (sc->mfi_cdev !=3D NULL)=0A=
 		destroy_dev(sc->mfi_cdev);=0A=
 =0A=
-	if (sc->mfi_total_cmds !=3D 0) {=0A=
-		for (i =3D 0; i < sc->mfi_total_cmds; i++) {=0A=
+	if (sc->mfi_commands !=3D NULL) {=0A=
+		for (i =3D 0; i < sc->mfi_max_fw_cmds; i++) {=0A=
 			cm =3D &sc->mfi_commands[i];=0A=
 			bus_dmamap_destroy(sc->mfi_buffer_dmat, cm->cm_dmamap);=0A=
 		}=0A=
 		free(sc->mfi_commands, M_MFIBUF);=0A=
+		sc->mfi_commands =3D NULL;=0A=
 	}=0A=
 =0A=
 	if (sc->mfi_intr)=0A=
@@ -1165,7 +1190,8 @@=0A=
 		/* End LSIP200113393 */=0A=
 		/* ThunderBolt INIT packet memory Free */=0A=
 		if (sc->mfi_tb_init_busaddr !=3D 0)=0A=
-			bus_dmamap_unload(sc->mfi_tb_init_dmat, sc->mfi_tb_init_dmamap);=0A=
+			bus_dmamap_unload(sc->mfi_tb_init_dmat,=0A=
+			    sc->mfi_tb_init_dmamap);=0A=
 		if (sc->mfi_tb_init !=3D NULL)=0A=
 			bus_dmamem_free(sc->mfi_tb_init_dmat, sc->mfi_tb_init,=0A=
 			    sc->mfi_tb_init_dmamap);=0A=
@@ -1182,16 +1208,14 @@=0A=
 			    sc->mfi_tb_ioc_init_dmamap);=0A=
 		if (sc->mfi_tb_ioc_init_dmat !=3D NULL)=0A=
 			bus_dma_tag_destroy(sc->mfi_tb_ioc_init_dmat);=0A=
-		for (int i =3D 0; i < sc->mfi_max_fw_cmds; i++) {=0A=
-			if (sc->mfi_cmd_pool_tbolt !=3D NULL) {=0A=
+		if (sc->mfi_cmd_pool_tbolt !=3D NULL) {=0A=
+			for (int i =3D 0; i < sc->mfi_max_fw_cmds; i++) {=0A=
 				if (sc->mfi_cmd_pool_tbolt[i] !=3D NULL) {=0A=
 					free(sc->mfi_cmd_pool_tbolt[i],=0A=
 					    M_MFIBUF);=0A=
 					sc->mfi_cmd_pool_tbolt[i] =3D NULL;=0A=
 				}=0A=
 			}=0A=
-		}=0A=
-		if (sc->mfi_cmd_pool_tbolt !=3D NULL) {=0A=
 			free(sc->mfi_cmd_pool_tbolt, M_MFIBUF);=0A=
 			sc->mfi_cmd_pool_tbolt =3D NULL;=0A=
 		}=0A=
@@ -1256,9 +1280,8 @@=0A=
 			cm->cm_error =3D 0;=0A=
 			mfi_complete(sc, cm);=0A=
 		}=0A=
-		if (++ci =3D=3D (sc->mfi_max_fw_cmds + 1)) {=0A=
+		if (++ci =3D=3D (sc->mfi_max_fw_cmds + 1))=0A=
 			ci =3D 0;=0A=
-		}=0A=
 	}=0A=
 =0A=
 	sc->mfi_comms->hw_ci =3D ci;=0A=
@@ -1288,15 +1311,15 @@=0A=
 	int error;=0A=
 =0A=
 =0A=
-	if (sc->mfi_aen_cm)=0A=
+	if (sc->mfi_aen_cm !=3D NULL) {=0A=
 		sc->cm_aen_abort =3D 1;=0A=
-	if (sc->mfi_aen_cm !=3D NULL)=0A=
 		mfi_abort(sc, &sc->mfi_aen_cm);=0A=
+	}=0A=
 =0A=
-	if (sc->mfi_map_sync_cm)=0A=
+	if (sc->mfi_map_sync_cm !=3D NULL) {=0A=
 		sc->cm_map_abort =3D 1;=0A=
-	if (sc->mfi_map_sync_cm !=3D NULL)=0A=
 		mfi_abort(sc, &sc->mfi_map_sync_cm);=0A=
+	}=0A=
 =0A=
 	mtx_lock(&sc->mfi_io_lock);=0A=
 	error =3D mfi_dcmd_command(sc, &cm, MFI_DCMD_CTRL_SHUTDOWN, NULL, 0);=0A=
@@ -1310,9 +1333,8 @@=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED;=0A=
 	cm->cm_data =3D NULL;=0A=
 =0A=
-	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0) {=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
 		device_printf(sc->mfi_dev, "Failed to shutdown controller\n");=0A=
-	}=0A=
 =0A=
 	mfi_release_command(cm);=0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
@@ -1683,11 +1705,11 @@=0A=
 	sc =3D cm->cm_sc;=0A=
 	mtx_assert(&sc->mfi_io_lock, MA_OWNED);=0A=
 =0A=
-	hdr =3D &cm->cm_frame->header;=0A=
-=0A=
 	if (sc->mfi_aen_cm =3D=3D NULL)=0A=
 		return;=0A=
 =0A=
+	hdr =3D &cm->cm_frame->header;=0A=
+=0A=
 	if (sc->cm_aen_abort ||=0A=
 	    hdr->cmd_status =3D=3D MFI_STAT_INVALID_STATUS) {=0A=
 		sc->cm_aen_abort =3D 0;=0A=
@@ -1713,8 +1735,8 @@=0A=
 	}=0A=
 =0A=
 	free(cm->cm_data, M_MFIBUF);=0A=
-	sc->mfi_aen_cm =3D NULL;=0A=
 	wakeup(&sc->mfi_aen_cm);=0A=
+	sc->mfi_aen_cm =3D NULL;=0A=
 	mfi_release_command(cm);=0A=
 =0A=
 	/* set it up again so the driver can catch more events */=0A=
@@ -1796,6 +1818,7 @@=0A=
 			mtx_lock(&sc->mfi_io_lock);=0A=
 			mfi_release_command(cm);=0A=
 			mtx_unlock(&sc->mfi_io_lock);=0A=
+			error =3D EIO;=0A=
 			break;=0A=
 		}=0A=
 		mtx_lock(&sc->mfi_io_lock);=0A=
@@ -1824,7 +1847,7 @@=0A=
 	}=0A=
 =0A=
 	free(el, M_MFIBUF);=0A=
-	return (0);=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 static int=0A=
@@ -1941,11 +1964,12 @@=0A=
 	dcmd->mbox[0]=3Did;=0A=
 	dcmd->header.scsi_status =3D 0;=0A=
 	dcmd->header.pad0 =3D 0;=0A=
-	if (mfi_mapcmd(sc, cm) !=3D 0) {=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0) {=0A=
 		device_printf(sc->mfi_dev,=0A=
 		    "Failed to get physical drive info %d\n", id);=0A=
 		free(pd_info, M_MFIBUF);=0A=
-		return (0);=0A=
+		mfi_release_command(cm);=0A=
+		return (error);=0A=
 	}=0A=
 	bus_dmamap_sync(sc->mfi_buffer_dmat, cm->cm_dmamap,=0A=
 	    BUS_DMASYNC_POSTREAD);=0A=
@@ -2211,11 +2235,14 @@=0A=
 	if ((hdr->cmd_status !=3D MFI_STAT_OK) || (hdr->scsi_status !=3D 0)) {=0A=
 		bio->bio_flags |=3D BIO_ERROR;=0A=
 		bio->bio_error =3D EIO;=0A=
-		device_printf(sc->mfi_dev, "I/O error, status=3D %d "=0A=
-		    "scsi_status=3D %d\n", hdr->cmd_status, hdr->scsi_status);=0A=
+		device_printf(sc->mfi_dev, "I/O error, status=3D%#x "=0A=
+		    "scsi_status=3D%#x\n", hdr->cmd_status, hdr->scsi_status);=0A=
 		mfi_print_sense(cm->cm_sc, cm->cm_sense);=0A=
 	} else if (cm->cm_error !=3D 0) {=0A=
 		bio->bio_flags |=3D BIO_ERROR;=0A=
+		bio->bio_error =3D cm->cm_error;=0A=
+		device_printf(sc->mfi_dev, "I/O error, error=3D%#x\n",=0A=
+		    cm->cm_error);=0A=
 	}=0A=
 =0A=
 	mfi_release_command(cm);=0A=
@@ -2251,6 +2278,9 @@=0A=
 =0A=
 		/* Send the command to the controller */=0A=
 		if (mfi_mapcmd(sc, cm) !=3D 0) {=0A=
+			device_printf(sc->mfi_dev, "Failed to startio\n");=0A=
+			if ((cm->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0)=0A=
+				mfi_remove_busy(cm);=0A=
 			mfi_requeue_ready(cm);=0A=
 			break;=0A=
 		}=0A=
@@ -2374,7 +2404,7 @@=0A=
 	cm->cm_extra_frames =3D (cm->cm_total_frame_size - 1) / MFI_FRAME_SIZE;=0A=
 =0A=
 	if (sc->MFA_enabled)=0A=
-			mfi_tbolt_send_frame(sc, cm);=0A=
+		mfi_tbolt_send_frame(sc, cm);=0A=
 	else=0A=
 		mfi_send_frame(sc, cm);=0A=
 =0A=
@@ -2466,7 +2496,7 @@=0A=
 {=0A=
 	struct mfi_command *cm;=0A=
 	struct mfi_abort_frame *abort;=0A=
-	int i =3D 0;=0A=
+	int i =3D 0, error;=0A=
 	uint32_t context =3D 0;=0A=
 =0A=
 	mtx_lock(&sc->mfi_io_lock);=0A=
@@ -2490,7 +2520,8 @@=0A=
 	cm->cm_data =3D NULL;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED;=0A=
 =0A=
-	mfi_mapcmd(sc, cm);=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
+		device_printf(sc->mfi_dev, "failed to abort command\n");=0A=
 	mfi_release_command(cm);=0A=
 =0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
@@ -2506,7 +2537,7 @@=0A=
 		mtx_unlock(&sc->mfi_io_lock);=0A=
 	}=0A=
 =0A=
-	return (0);=0A=
+	return (error);=0A=
 }=0A=
 =0A=
 int=0A=
@@ -2544,7 +2575,8 @@=0A=
 	cm->cm_total_frame_size =3D MFI_IO_FRAME_SIZE;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED | MFI_CMD_DATAOUT;=0A=
 =0A=
-	error =3D mfi_mapcmd(sc, cm);=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
+		device_printf(sc->mfi_dev, "failed dump blocks\n");=0A=
 	bus_dmamap_sync(sc->mfi_buffer_dmat, cm->cm_dmamap,=0A=
 	    BUS_DMASYNC_POSTWRITE);=0A=
 	bus_dmamap_unload(sc->mfi_buffer_dmat, cm->cm_dmamap);=0A=
@@ -2587,7 +2619,8 @@=0A=
 	cm->cm_total_frame_size =3D MFI_PASS_FRAME_SIZE;=0A=
 	cm->cm_flags =3D MFI_CMD_POLLED | MFI_CMD_DATAOUT | MFI_CMD_SCSI;=0A=
 =0A=
-	error =3D mfi_mapcmd(sc, cm);=0A=
+	if ((error =3D mfi_mapcmd(sc, cm)) !=3D 0)=0A=
+		device_printf(sc->mfi_dev, "failed dump blocks\n");=0A=
 	bus_dmamap_sync(sc->mfi_buffer_dmat, cm->cm_dmamap,=0A=
 	    BUS_DMASYNC_POSTWRITE);=0A=
 	bus_dmamap_unload(sc->mfi_buffer_dmat, cm->cm_dmamap);=0A=
@@ -3632,7 +3665,7 @@=0A=
 		deadline =3D time_uptime - mfi_cmd_timeout;=0A=
 		mtx_lock(&sc->mfi_io_lock);=0A=
 		TAILQ_FOREACH(cm, &sc->mfi_busy, cm_link) {=0A=
-			if (cm->cm_timestamp < deadline) {=0A=
+			if (cm->cm_timestamp <=3D deadline) {=0A=
 				device_printf(sc->mfi_dev,=0A=
 				    "COMMAND %p TIMEOUT AFTER %d SECONDS\n",=0A=
 				    cm, (int)(time_uptime - cm->cm_timestamp));=0A=
@@ -3643,7 +3676,7 @@=0A=
 =0A=
 #if 0=0A=
 		if (timedout)=0A=
-			MFI_DUMP_CMDS(SC);=0A=
+			MFI_DUMP_CMDS(sc);=0A=
 #endif=0A=
 =0A=
 		mtx_unlock(&sc->mfi_io_lock);=0A=
@@ -3656,7 +3689,7 @@=0A=
 mfi_timeout(void *data)=0A=
 {=0A=
 	struct mfi_softc *sc =3D (struct mfi_softc *)data;=0A=
-	struct mfi_command *cm;=0A=
+	struct mfi_command *cm, *tmp;=0A=
 	time_t deadline;=0A=
 	int timedout =3D 0;=0A=
 =0A=
@@ -3669,10 +3702,10 @@=0A=
 		}=0A=
 	}=0A=
 	mtx_lock(&sc->mfi_io_lock);=0A=
-	TAILQ_FOREACH(cm, &sc->mfi_busy, cm_link) {=0A=
+	TAILQ_FOREACH_SAFE(cm, &sc->mfi_busy, cm_link, tmp) {=0A=
 		if (sc->mfi_aen_cm =3D=3D cm || sc->mfi_map_sync_cm =3D=3D cm)=0A=
 			continue;=0A=
-		if (cm->cm_timestamp < deadline) {=0A=
+		if (cm->cm_timestamp <=3D deadline) {=0A=
 			if (sc->adpreset !=3D 0 && sc->issuepend_done =3D=3D 0) {=0A=
 				cm->cm_timestamp =3D time_uptime;=0A=
 			} else {=0A=
@@ -3682,6 +3715,13 @@=0A=
 				     );=0A=
 				MFI_PRINT_CMD(cm);=0A=
 				MFI_VALIDATE_CMD(sc, cm);=0A=
+				/*=0A=
+				 * Fail the command instead of leaving it on=0A=
+				 * the queue where it could remain stuck forever=0A=
+				 */=0A=
+				mfi_remove_busy(cm);=0A=
+				cm->cm_error =3D ETIMEDOUT;=0A=
+				mfi_complete(sc, cm);=0A=
 				timedout++;=0A=
 			}=0A=
 		}=0A=
@@ -3689,7 +3729,7 @@=0A=
 =0A=
 #if 0=0A=
 	if (timedout)=0A=
-		MFI_DUMP_CMDS(SC);=0A=
+		MFI_DUMP_CMDS(sc);=0A=
 #endif=0A=
 =0A=
 	mtx_unlock(&sc->mfi_io_lock);=0A=
--- sys/dev/mfi/mfi_tbolt.c.orig	2012-11-07 23:00:24.542124476 +0000=0A=
+++ sys/dev/mfi/mfi_tbolt.c	2012-11-09 14:16:13.036767564 +0000=0A=
@@ -162,14 +162,14 @@=0A=
 	while (!( HostDiag & DIAG_WRITE_ENABLE)) {=0A=
 		for (i =3D 0; i < 1000; i++);=0A=
 		HostDiag =3D (uint32_t)MFI_READ4(sc, MFI_HDR);=0A=
-		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%x, "=0A=
-		    "hostdiag=3D%x\n", retry, HostDiag);=0A=
+		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%d, "=0A=
+		    "hostdiag=3D%#x\n", retry, HostDiag);=0A=
 =0A=
 		if (retry++ >=3D 100)=0A=
 			return 1;=0A=
 	}=0A=
 =0A=
-	device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: HostDiag=3D%x\n", =
HostDiag);=0A=
+	device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: HostDiag=3D%#x\n", =
HostDiag);=0A=
 =0A=
 	MFI_WRITE4(sc, MFI_HDR, (HostDiag | DIAG_RESET_ADAPTER));=0A=
 =0A=
@@ -181,8 +181,8 @@=0A=
 	while (HostDiag & DIAG_RESET_ADAPTER) {=0A=
 		for (i =3D 0; i < 1000; i++) ;=0A=
 		HostDiag =3D (uint32_t)MFI_READ4(sc, MFI_RSR);=0A=
-		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%x, "=0A=
-		    "hostdiag=3D%x\n", retry, HostDiag);=0A=
+		device_printf(sc->mfi_dev, "ADP_RESET_TBOLT: retry time=3D%d, "=0A=
+		    "hostdiag=3D%#x\n", retry, HostDiag);=0A=
 =0A=
 		if (retry++ >=3D 1000)=0A=
 			return 1;=0A=
@@ -447,13 +447,21 @@=0A=
 	sc->request_desc_pool =3D malloc(sizeof(=0A=
 	    union mfi_mpi2_request_descriptor) * sc->mfi_max_fw_cmds,=0A=
 	    M_MFIBUF, M_NOWAIT|M_ZERO);=0A=
+=0A=
+	if (sc->request_desc_pool =3D=3D NULL) {=0A=
+		device_printf(sc->mfi_dev, "Could not alloc "=0A=
+		    "memory for request_desc_pool\n");=0A=
+		return (ENOMEM);=0A=
+	}=0A=
+=0A=
 	sc->mfi_cmd_pool_tbolt =3D malloc(sizeof(struct mfi_cmd_tbolt*)=0A=
 	    * sc->mfi_max_fw_cmds, M_MFIBUF, M_NOWAIT|M_ZERO);=0A=
 =0A=
-	if (!sc->mfi_cmd_pool_tbolt) {=0A=
-		device_printf(sc->mfi_dev, "out of memory. Could not alloc "=0A=
-		    "memory for cmd_list_fusion\n");=0A=
-		return 1;=0A=
+	if (sc->mfi_cmd_pool_tbolt =3D=3D NULL) {=0A=
+		free(sc->request_desc_pool, M_MFIBUF);=0A=
+		device_printf(sc->mfi_dev, "Could not alloc "=0A=
+		    "memory for cmd_pool_tbolt\n");=0A=
+		return (ENOMEM);=0A=
 	}=0A=
 =0A=
 	for (i =3D 0; i < sc->mfi_max_fw_cmds; i++) {=0A=
@@ -461,20 +469,24 @@=0A=
 		    struct mfi_cmd_tbolt),M_MFIBUF, M_NOWAIT|M_ZERO);=0A=
 =0A=
 		if (!sc->mfi_cmd_pool_tbolt[i]) {=0A=
-			device_printf(sc->mfi_dev, "Could not alloc cmd list "=0A=
-			    "fusion\n");=0A=
+			device_printf(sc->mfi_dev, "Could not alloc "=0A=
+			    "cmd_pool_tbolt entry\n");=0A=
 =0A=
 			for (j =3D 0; j < i; j++)=0A=
 				free(sc->mfi_cmd_pool_tbolt[j], M_MFIBUF);=0A=
 =0A=
+			free(sc->request_desc_pool, M_MFIBUF);=0A=
+			sc->request_desc_pool =3D NULL;=0A=
 			free(sc->mfi_cmd_pool_tbolt, M_MFIBUF);=0A=
 			sc->mfi_cmd_pool_tbolt =3D NULL;=0A=
+=0A=
+			return (ENOMEM);=0A=
 		}=0A=
 	}=0A=
 =0A=
 	/*=0A=
 	 * The first 256 bytes (SMID 0) is not used. Don't add to the cmd=0A=
-	 *list=0A=
+	 * list=0A=
 	 */=0A=
 	io_req_base =3D sc->request_message_pool_align=0A=
 		+ MEGASAS_THUNDERBOLT_NEW_MSG_SIZE;=0A=
@@ -614,9 +626,16 @@=0A=
 static inline void=0A=
 mfi_tbolt_return_cmd(struct mfi_softc *sc, struct mfi_cmd_tbolt *cmd)=0A=
 {=0A=
+/* TODO: remove this debugging */=0A=
+struct mfi_cmd_tbolt *tbolt_cmd;=0A=
 	mtx_assert(&sc->mfi_io_lock, MA_OWNED);=0A=
 =0A=
 	cmd->sync_cmd_idx =3D sc->mfi_max_fw_cmds;=0A=
+/* TODO: remove this debugging */=0A=
+TAILQ_FOREACH(tbolt_cmd, &sc->mfi_cmd_tbolt_tqh, next) {=0A=
+	if (tbolt_cmd =3D=3D cmd)=0A=
+		panic("returning tbolt cmd=3D%p thats already present", cmd);=0A=
+}=0A=
 	TAILQ_INSERT_TAIL(&sc->mfi_cmd_tbolt_tqh, cmd, next);=0A=
 }=0A=
 =0A=
@@ -652,13 +671,19 @@=0A=
 	/* Read Reply descriptor */=0A=
 	while ((val.u.low !=3D 0xFFFFFFFF) && (val.u.high !=3D 0xFFFFFFFF)) {=0A=
 		smid =3D reply_desc->SMID;=0A=
-		if (!smid || smid > sc->mfi_max_fw_cmds + 1) {=0A=
-			device_printf(sc->mfi_dev, "smid is %x. Cannot "=0A=
+		if (!smid || smid > sc->mfi_max_fw_cmds) {=0A=
+			device_printf(sc->mfi_dev, "smid is %d. Cannot "=0A=
 			    "proceed. Returning \n", smid);=0A=
 			return;=0A=
 		}=0A=
 =0A=
 		cmd_tbolt =3D sc->mfi_cmd_pool_tbolt[smid - 1];=0A=
+		if (cmd_tbolt->sync_cmd_idx =3D=3D sc->mfi_max_fw_cmds) {=0A=
+			device_printf(sc->mfi_dev, "cmd_tbolt %p "=0A=
+			    "has invalid sync_cmd_idx=3D%d - returning\n",=0A=
+			    cmd_tbolt, cmd_tbolt->sync_cmd_idx);=0A=
+			return;=0A=
+		}=0A=
 		cmd_mfi =3D &sc->mfi_commands[cmd_tbolt->sync_cmd_idx];=0A=
 		scsi_io_req =3D cmd_tbolt->io_request;=0A=
 =0A=
@@ -666,10 +691,9 @@=0A=
 		extStatus =3D cmd_mfi->cm_frame->dcmd.header.scsi_status;=0A=
 		map_tbolt_cmd_status(cmd_mfi, status, extStatus);=0A=
 =0A=
-		if (cmd_mfi->cm_flags & MFI_CMD_SCSI &&=0A=
+		if ((cmd_mfi->cm_flags & MFI_CMD_SCSI) !=3D 0 &&=0A=
 		    (cmd_mfi->cm_flags & MFI_CMD_POLLED) !=3D 0) {=0A=
 			/* polled LD/SYSPD IO command */=0A=
-			mfi_tbolt_return_cmd(sc, cmd_tbolt);=0A=
 			/* XXX mark okay for now DJA */=0A=
 			cmd_mfi->cm_frame->header.cmd_status =3D MFI_STAT_OK;=0A=
 		} else {=0A=
@@ -684,8 +708,8 @@=0A=
 =0A=
 			/* complete the command */=0A=
 			mfi_complete(sc, cmd_mfi);=0A=
-			mfi_tbolt_return_cmd(sc, cmd_tbolt);=0A=
 		}=0A=
+		mfi_tbolt_return_cmd(sc, cmd_tbolt);=0A=
 =0A=
 		sc->last_reply_idx++;=0A=
 		if (sc->last_reply_idx >=3D sc->mfi_max_fw_cmds) {=0A=
@@ -734,7 +758,8 @@=0A=
 =0A=
 	mtx_assert(&sc->mfi_io_lock, MA_OWNED);=0A=
 =0A=
-	cmd =3D TAILQ_FIRST(&sc->mfi_cmd_tbolt_tqh);=0A=
+	if ((cmd =3D TAILQ_FIRST(&sc->mfi_cmd_tbolt_tqh)) =3D=3D NULL)=0A=
+		return NULL;=0A=
 	TAILQ_REMOVE(&sc->mfi_cmd_tbolt_tqh, cmd, next);=0A=
 	memset((uint8_t *)cmd->sg_frame, 0, MEGASAS_MAX_SZ_CHAIN_FRAME);=0A=
 	memset((uint8_t *)cmd->io_request, 0,=0A=
@@ -988,8 +1013,10 @@=0A=
 =0A=
 	index =3D cmd->index;=0A=
 	req_desc =3D mfi_tbolt_get_request_descriptor(sc, index-1);=0A=
-	if (mfi_tbolt_build_io(sc, mfi_cmd, cmd))=0A=
+	if (mfi_tbolt_build_io(sc, mfi_cmd, cmd) !=3D 0) {=0A=
+		mfi_tbolt_return_cmd(sc, cmd);=0A=
 		return NULL;=0A=
+	}=0A=
 	req_desc->header.SMID =3D index;=0A=
 	return req_desc;=0A=
 }=0A=
@@ -1119,9 +1146,9 @@=0A=
 		 * should be performed on the controller=0A=
 		 */=0A=
 		if (cm->retry_for_fw_reset =3D=3D 3) {=0A=
-			device_printf(sc->mfi_dev, "megaraid_sas: command %d "=0A=
-			    "was tried multiple times during adapter reset"=0A=
-			    "Shutting down the HBA\n", cm->cm_index);=0A=
+			device_printf(sc->mfi_dev, "megaraid_sas: command %p "=0A=
+			    "index=3D%d was tried multiple times during adapter "=0A=
+			    "reset - Shutting down the HBA\n", cm, cm->cm_index);=0A=
 			mfi_kill_hba(sc);=0A=
 			sc->hw_crit_error =3D 1;=0A=
 			return;=0A=
@@ -1130,15 +1157,14 @@=0A=
 		if ((cm->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0) {=0A=
 			struct mfi_cmd_tbolt *cmd;=0A=
 			mfi_remove_busy(cm);=0A=
-			cmd =3D sc->mfi_cmd_pool_tbolt[cm->cm_extra_frames -=0A=
-			    1 ];=0A=
+			cmd =3D sc->mfi_cmd_pool_tbolt[cm->cm_extra_frames - 1];=0A=
 			mfi_tbolt_return_cmd(sc, cmd);=0A=
 			if ((cm->cm_flags & MFI_ON_MFIQ_MASK) =3D=3D 0) {=0A=
 				if (cm->cm_frame->dcmd.opcode !=3D=0A=
 				    MFI_DCMD_CTRL_EVENT_WAIT) {=0A=
 					device_printf(sc->mfi_dev,=0A=
-					    "APJ ****requeue command %d \n",=0A=
-					    cm->cm_index);=0A=
+					    "APJ ****requeue command %p "=0A=
+					    "index=3D%d\n", cm, cm->cm_index);=0A=
 					mfi_requeue_ready(cm);=0A=
 				}=0A=
 			}=0A=
@@ -1196,22 +1222,22 @@=0A=
 		}=0A=
 		mtx_unlock(&sc->mfi_io_lock);=0A=
 		if ((error =3D mfi_tbolt_init_MFI_queue(sc)) !=3D 0)=0A=
-				return;=0A=
+			return;=0A=
 =0A=
 		mtx_lock(&sc->mfi_io_lock);=0A=
 =0A=
 		sc->mfi_enable_intr(sc);=0A=
 		sc->adpreset =3D 0;=0A=
-		free(sc->mfi_aen_cm->cm_data, M_MFIBUF);=0A=
-		mfi_remove_busy(sc->mfi_aen_cm);=0A=
-		cmd =3D sc->mfi_cmd_pool_tbolt[sc->mfi_aen_cm->cm_extra_frames=0A=
-		    - 1];=0A=
-		mfi_tbolt_return_cmd(sc, cmd);=0A=
-		if (sc->mfi_aen_cm) {=0A=
+		if (sc->mfi_aen_cm !=3D NULL) {=0A=
+			free(sc->mfi_aen_cm->cm_data, M_MFIBUF);=0A=
+			mfi_remove_busy(sc->mfi_aen_cm);=0A=
+			cmd =3D sc->mfi_cmd_pool_tbolt[=0A=
+			    sc->mfi_aen_cm->cm_extra_frames - 1];=0A=
+			mfi_tbolt_return_cmd(sc, cmd);=0A=
 			mfi_release_command(sc->mfi_aen_cm);=0A=
 			sc->mfi_aen_cm =3D NULL;=0A=
 		}=0A=
-		if (sc->mfi_map_sync_cm) {=0A=
+		if (sc->mfi_map_sync_cm !=3D NULL) {=0A=
 			mfi_release_command(sc->mfi_map_sync_cm);=0A=
 			sc->mfi_map_sync_cm =3D NULL;=0A=
 		}=0A=
@@ -1357,6 +1383,8 @@=0A=
 		device_printf(sc->mfi_dev, "failed to send map sync\n");=0A=
 		free(ld_sync, M_MFIBUF);=0A=
 		sc->mfi_map_sync_cm =3D NULL;=0A=
+		if ((cmd->cm_flags & MFI_ON_MFIQ_BUSY) !=3D 0)=0A=
+			mfi_remove_busy(cmd);=0A=
 		mfi_requeue_ready(cmd);=0A=
 		goto out;=0A=
 	}=0A=
--- sys/dev/mfi/mfi_debug.c.orig	2012-11-09 00:56:43.485019300 +0000=0A=
+++ sys/dev/mfi/mfi_debug.c	2012-11-09 02:24:35.569778892 +0000=0A=
@@ -271,7 +271,7 @@=0A=
 {=0A=
 	int i;=0A=
 =0A=
-	for (i =3D 0; i < sc->mfi_total_cmds; i++)=0A=
+	for (i =3D 0; i < sc->mfi_max_fw_cmds; i++)=0A=
 		mfi_print_generic_frame(sc, &sc->mfi_commands[i]);=0A=
 }=0A=
 =0A=
--- sys/dev/mfi/mfivar.h.orig	2012-11-09 00:55:55.000000000 +0000=0A=
+++ sys/dev/mfi/mfivar.h	2012-11-09 14:36:01.203796033 +0000=0A=
@@ -267,10 +267,6 @@=0A=
 	 */=0A=
 	struct mfi_command		*mfi_commands;=0A=
 	/*=0A=
-	 * How many commands were actually allocated=0A=
-	 */=0A=
-	int				mfi_total_cmds;=0A=
-	/*=0A=
 	 * How many commands the firmware can handle.  Also how big the reply=0A=
 	 * queue is, minus 1.=0A=
 	 */=0A=

------=_NextPart_000_0276_01CDBE9C.80113780--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F46B51033DB84937AEEC8F4A95211DAB>