From owner-freebsd-stable@FreeBSD.ORG  Fri Nov  9 17:25:09 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 3CD5A274;
 Fri,  9 Nov 2012 17:25:09 +0000 (UTC)
 (envelope-from ambrisko@ambrisko.com)
Received: from mail.ambrisko.com (mail.ambrisko.com [70.91.206.90])
 by mx1.freebsd.org (Postfix) with ESMTP id 0A1918FC0A;
 Fri,  9 Nov 2012 17:25:08 +0000 (UTC)
X-Ambrisko-Me: Yes
Received: from server2.ambrisko.com (HELO internal.ambrisko.com)
 ([192.168.1.2])
 by ironport.ambrisko.com with ESMTP; 09 Nov 2012 09:26:27 -0800
Received: from ambrisko.com (localhost [127.0.0.1])
 by internal.ambrisko.com (8.14.4/8.14.4) with ESMTP id qA9HP8eY013825;
 Fri, 9 Nov 2012 09:25:08 -0800 (PST)
 (envelope-from ambrisko@ambrisko.com)
Received: (from ambrisko@localhost)
 by ambrisko.com (8.14.4/8.14.4/Submit) id qA9HP8oW013824;
 Fri, 9 Nov 2012 09:25:08 -0800 (PST) (envelope-from ambrisko)
Date: Fri, 9 Nov 2012 09:25:08 -0800
From: Doug Ambrisko <ambrisko@ambrisko.com>
To: Steven Hartland <killing@multiplay.co.uk>
Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock
Message-ID: <20121109172508.GA13333@ambrisko.com>
References: <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk>
 <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk>
 <20121105212911.GA17904@ambrisko.com>
 <27169C7FE704495087A093752D15E7B6@multiplay.co.uk>
 <20121106180152.GA40422@ambrisko.com>
 <6B5B65F4FC854EB8BBC701500096602E@multiplay.co.uk>
 <0B4E8AFF9DA04C6EBD2496A8B58F1D67@multiplay.co.uk>
 <F46B51033DB84937AEEC8F4A95211DAB@multiplay.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <F46B51033DB84937AEEC8F4A95211DAB@multiplay.co.uk>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Nov 2012 17:25:09 -0000

On Fri, Nov 09, 2012 at 05:06:03PM -0000, Steven Hartland wrote:
| 
| ----- Original Message ----- 
| From: "Steven Hartland"
| ...
| >I've just had another panic, trace below, but it doesn't seem to be related
| >to my changes so I'd appreciate your feedback on them as they are for now.
| >
| >While the lock patch fixes the problems I've seen, its not clear to me
| >why mfi_tbolt_reset is acquiring the lock and hence requiring
| >mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
| >around queue manipulation is done correctly. Given what its doing
| >(resetting the entire adapter) I wouldn't be surprised if it should
| >really be acquiring the config lock.
| >
| >Other things I've noticed / questions
| >* Should mfi_abort sleep even if its call to mfi_mapcmd fails?
| >* Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
| >* Do these controllers not support none 512 byte requests? Currently
| >all syspd requests are done assuming 512 byte sectors which the disk may
| >not be. This will both reduce performance or potentially break totally
| >if the firmware isn't translating it under the surface correctly.
| >
| >Anyway the new panic manually transcribed is:-
| >panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
| >...
| >mfi_tbolt_get_cmd()
| >mfi_build_mpt_pass_thru()
| >mfi_tbolt_build_mpt_cmd()
| >mfi_tbolt_send_frame()
| >bus_dmamap_load()
| >mfi_mapcmd()
| >mfi_startio()
| >mfi_syspd_strategy()
| >g_disk_start()
| >g_io_schedule_down()
| >g_down_proc_body()
| >fork_exit()
| >fork_trampoline()
| >
| >Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
| >can tell all manip is done using the TAILQ macros and under mfi_io_lock
| >so its not obvious to me at this time why this is, any ideas?
| 
| I've gone through looking for the possible cause of this and while there's
| nothing directly connected to the manip of this queue I've found and fixed
| quite a large number of additional problems which may have been indirectly
| causing this problem.
| 
| The biggest change is to use mfi_max_cmds to limit the value stored in
| sc->mfi_max_fw_cmds as this is used extensively throughout the driver
| for allocation and range checks so having this inconsitently set opened up
| a large number of possible overrun errors.
| 
| The new patch attached documents all the changes in detail.
| 
| I've managed to do one test run so far which failed to reproduce any panics,
| so definitely moving in the right direction :)
| 
| The machine has now been collected for repair by the supplier but I'm going
| to try and get them to put it online for more testing over the weekend.
| 
| Given the failure rate so far if I can do another 4 runs with no panics I'd
| be happy that the majority of error conditions are working as expected.

Sounds like you have made some good progress.  I looked at your prior locking
change and they good.  Haven't had time to go through the queue changes
yet.

Thanks,

Doug A.