Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 9 Nov 2012 09:25:08 -0800
From:      Doug Ambrisko <ambrisko@ambrisko.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org
Subject:   Re: mfi panic on recused on non-recusive mutex MFI I/O lock
Message-ID:  <20121109172508.GA13333@ambrisko.com>
In-Reply-To: <F46B51033DB84937AEEC8F4A95211DAB@multiplay.co.uk>
References:  <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk> <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk> <20121105212911.GA17904@ambrisko.com> <27169C7FE704495087A093752D15E7B6@multiplay.co.uk> <20121106180152.GA40422@ambrisko.com> <6B5B65F4FC854EB8BBC701500096602E@multiplay.co.uk> <0B4E8AFF9DA04C6EBD2496A8B58F1D67@multiplay.co.uk> <F46B51033DB84937AEEC8F4A95211DAB@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Nov 09, 2012 at 05:06:03PM -0000, Steven Hartland wrote:
| 
| ----- Original Message ----- 
| From: "Steven Hartland"
| ...
| >I've just had another panic, trace below, but it doesn't seem to be related
| >to my changes so I'd appreciate your feedback on them as they are for now.
| >
| >While the lock patch fixes the problems I've seen, its not clear to me
| >why mfi_tbolt_reset is acquiring the lock and hence requiring
| >mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
| >around queue manipulation is done correctly. Given what its doing
| >(resetting the entire adapter) I wouldn't be surprised if it should
| >really be acquiring the config lock.
| >
| >Other things I've noticed / questions
| >* Should mfi_abort sleep even if its call to mfi_mapcmd fails?
| >* Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
| >* Do these controllers not support none 512 byte requests? Currently
| >all syspd requests are done assuming 512 byte sectors which the disk may
| >not be. This will both reduce performance or potentially break totally
| >if the firmware isn't translating it under the surface correctly.
| >
| >Anyway the new panic manually transcribed is:-
| >panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
| >...
| >mfi_tbolt_get_cmd()
| >mfi_build_mpt_pass_thru()
| >mfi_tbolt_build_mpt_cmd()
| >mfi_tbolt_send_frame()
| >bus_dmamap_load()
| >mfi_mapcmd()
| >mfi_startio()
| >mfi_syspd_strategy()
| >g_disk_start()
| >g_io_schedule_down()
| >g_down_proc_body()
| >fork_exit()
| >fork_trampoline()
| >
| >Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
| >can tell all manip is done using the TAILQ macros and under mfi_io_lock
| >so its not obvious to me at this time why this is, any ideas?
| 
| I've gone through looking for the possible cause of this and while there's
| nothing directly connected to the manip of this queue I've found and fixed
| quite a large number of additional problems which may have been indirectly
| causing this problem.
| 
| The biggest change is to use mfi_max_cmds to limit the value stored in
| sc->mfi_max_fw_cmds as this is used extensively throughout the driver
| for allocation and range checks so having this inconsitently set opened up
| a large number of possible overrun errors.
| 
| The new patch attached documents all the changes in detail.
| 
| I've managed to do one test run so far which failed to reproduce any panics,
| so definitely moving in the right direction :)
| 
| The machine has now been collected for repair by the supplier but I'm going
| to try and get them to put it online for more testing over the weekend.
| 
| Given the failure rate so far if I can do another 4 runs with no panics I'd
| be happy that the majority of error conditions are working as expected.

Sounds like you have made some good progress.  I looked at your prior locking
change and they good.  Haven't had time to go through the queue changes
yet.

Thanks,

Doug A.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121109172508.GA13333>