From owner-freebsd-scsi@FreeBSD.ORG Tue Nov 6 18:01:54 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8550D753; Tue, 6 Nov 2012 18:01:54 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from mail.ambrisko.com (mail.ambrisko.com [70.91.206.90]) by mx1.freebsd.org (Postfix) with ESMTP id 559B48FC0C; Tue, 6 Nov 2012 18:01:53 +0000 (UTC) X-Ambrisko-Me: Yes Received: from server2.ambrisko.com (HELO internal.ambrisko.com) ([192.168.1.2]) by ironport.ambrisko.com with ESMTP; 06 Nov 2012 10:03:11 -0800 Received: from ambrisko.com (localhost [127.0.0.1]) by internal.ambrisko.com (8.14.4/8.14.4) with ESMTP id qA6I1rHx042710; Tue, 6 Nov 2012 10:01:53 -0800 (PST) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.14.4/8.14.4/Submit) id qA6I1qaB042708; Tue, 6 Nov 2012 10:01:52 -0800 (PST) (envelope-from ambrisko) Date: Tue, 6 Nov 2012 10:01:52 -0800 From: Doug Ambrisko To: Steven Hartland Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock Message-ID: <20121106180152.GA40422@ambrisko.com> References: <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk> <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk> <20121105212911.GA17904@ambrisko.com> <27169C7FE704495087A093752D15E7B6@multiplay.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <27169C7FE704495087A093752D15E7B6@multiplay.co.uk> User-Agent: Mutt/1.4.2.3i Cc: freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Nov 2012 18:01:54 -0000 On Tue, Nov 06, 2012 at 12:09:42AM -0000, Steven Hartland wrote: | Thanks Doug, actually just finished another test run with some more | debugging in and I believe I've found the reason for the non-recusive | lock and at least some of the queuing issues. | | The non-recursive lock is due to the mfi_tbolt_reset calling | mfi_process_fw_state_chg_isr with mfi_io_lock held which in turn calls | mfi_tbolt_init_MFI_queue which tries to acquire mfi_io_lock hence | the problem. | | mfi-lock.txt attached I believe fixes this as well as what appears | to be an invalid call to mtx_unlock(&sc->mfi_io_lock) in mfi_attach | which never acquires the lock as far as can see, possibly a cut and | paste error. I don't seem to see the attachment. | The invalid queue problems seem to stem from the error cases of | the calls to mfi_mapcmd, some of which call mfi_release_command which | blindly sets cm_flags = 0 and then enqueues it on the free queue. Now | depending on the flow of mfi_mapcmd and where the error occurs the | command may or may not have been put on the busy queue which is going | to cause problems. | | Going to investigate this further but that's what my current theory is. | | Your patch seems quite extensive, so if could you give me brief run | down on the changes that would be most appreciated. I'll being doing that in the commit message which should happen today. | FYI, I'm aware that the cause of my underlying issues are some | hardware issues (likely cable or backplane related) but it does mean | I'm in the position to test these usually rare error cases, so wanting | the make the most of it before we get the hardware swapped out. That would be good. It makes it easier to debug things when it shows the problem. Thanks, Doug A.