From owner-freebsd-scsi@FreeBSD.ORG Mon Nov 5 16:55:16 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 28714B04; Mon, 5 Nov 2012 16:55:16 +0000 (UTC) (envelope-from prvs=1656497edf=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 7414B8FC0C; Mon, 5 Nov 2012 16:55:14 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50000960905.msg; Mon, 05 Nov 2012 16:55:12 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Mon, 05 Nov 2012 16:55:12 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1656497edf=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk> From: "Steven Hartland" To: , References: <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk> Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock Date: Mon, 5 Nov 2012 16:55:11 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Nov 2012 16:55:16 -0000 I've managed to get the machine to reproduce this fairly regularly now. Without a debug kernel it still results in a panic, just at a later stage or so I believe, the none debug panic messages is "command not in queue". In each none debug panic I've seen the cm_flags indicates the command being dequeued is on the busy queue and not on the expected free or ready queue which is being processed at the time. The triggering issue seems to be the adapter reset code run from mfi_timeout. I've had a good look but can't see how a cm could be in a queue yet have its cm_flags set to that of a different queue as all manipulation seems to be being done via the "mfi_ ## name" macros which all correctly maintain the queue / cm_flags relationship. At this point I believe it could be a thread being interrupted by a timeout part way the processing of a queue request hence queue and cm_flags being out of sync. Any pointers on how to debug this issue further / fix it would be most appreciated. Regards Steve ----- Original Message ----- From: "Steven Hartland" > Testing a new machine which is based on 8.3-RELEASE with the mfi > driver from 8-STABLE and just got a panic. > > > The below is translation of the hand copied from console:- > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90827650-90827905 > mfi0: I/O error, status= 46 scsi_status= 240 > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90827394-90827649 > mfi0: I/O error, status= 46 scsi_status= 240 > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90827138-90827393 > mfi0: I/O error, status= 46 scsi_status= 240 > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90826882-90827137 > mfi0: I/O error, status= 2 scsi_status= 2 > mfi0: sense error 112, sense_key 6, asc 41, ascq 0 > mfisyspd4: hard error cmd=write 90830466-90830721 > mfi0: I/O error, status= 2 scsi_status= 2 > mfi0: sense error 112, sense_key 6, asc 41, ascq 0 > mfisyspd5: hard error cmd=write 90830722-90830977 > mfi0: Adapter RESET condition detected > mfi0: First state FW reset initiated... > mfi0: ADP_RESET_TBOLT: HostDiag=a0 > mfi0: first state of reset complete, second state initiated... > mfi0: Second state FW reset initiated... > panic: _mtx_lock_sleep: recursed on non-recusive mutex MFI I/O lock @ /usr/src/sys/dev/mfi/mfi_tbolt:346 > > cpuid = 6 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > kdb_backtrace() at kdb_backtrace+0x37 > panic() at panic+0x178 > _mtx_lock_sleep() at _mtx_lock_sleep+0x152 > _mtx_lock_flags() at _mtx_lock_flags+0x80 > mfi_tbolt_init_MFI_queue() at mfi_tbolt_init_MFI_queue+0x72 > mfi_timeout() at mfi_timeout+0x27 > softclock() at softclock+0x2aa > intr_event_execute_handlers() at intr_event_execute_handlers+0x66 > ithread_loop() at ithread_loop+0xb2 > fork_exit() at fork_exit+0x135 > fork_trampoline() at fork_trampoline+0xe > --- trap 0, rip = 0, rsp = 0xffffff80005ccd00, rbp = 0 --- > KDB: enter panic > [thread pid 12 tid 100020 ] > Stopperd at kdb_enter+0x3b: movq $0,0x51cb32(%rip) > db> > > So questions:- > 1. What are the "hard error" errors? The machine was testing IO > with dd but due to the panic I cant tell if that was the cause. > 2. Looking at the code this seems like the reset was tripped by > firmware bug, is that the case? > 3. Is the fix the panic a simple one we cat test? ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.