From owner-freebsd-scsi@FreeBSD.ORG Mon Nov 12 11:06:50 2012 Return-Path: Delivered-To: freebsd-scsi@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D9DE1A55 for ; Mon, 12 Nov 2012 11:06:50 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id B57CA8FC0C for ; Mon, 12 Nov 2012 11:06:50 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id qACB6ocW000488 for ; Mon, 12 Nov 2012 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id qACB6ojM000486 for freebsd-scsi@FreeBSD.org; Mon, 12 Nov 2012 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 12 Nov 2012 11:06:50 GMT Message-Id: <201211121106.qACB6ojM000486@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-scsi@FreeBSD.org Subject: Current problem reports assigned to freebsd-scsi@FreeBSD.org X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Nov 2012 11:06:50 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/172575 scsi [mfi] ioctl CAMGETPASSTHRU fails with mfi driver o kern/171650 scsi [da] da(4) driver does not recognize end of cciss (Sma o kern/169976 scsi [cam] [patch] make scsi_da use sysctl values where app o kern/169835 scsi [patch] remove some unused variables from scsi_da prob o kern/169801 scsi [cam] [patc] make changes to delete_method in scsi_da o kern/169403 scsi [cam] [patch] CAM layer, I/O starvation, no fairness o kern/165982 scsi [mpt] mpt instability, drive resets, and losses on Fre o kern/165740 scsi [cam] SCSI code must drain callbacks before free o kern/163713 scsi [aic7xxx] [patch] Add Adaptec29329LPE to aic79xx_pci.c o kern/162256 scsi [mpt] QUEUE FULL EVENT and 'mpt_cam_event: 0x0' o kern/161809 scsi [cam] [patch] set kern.cam.boot_delay via build option o kern/159412 scsi [ciss] 7.3 RELEASE: ciss0 ADAPTER HEARTBEAT FAILED err o kern/157770 scsi [iscsi] [panic] iscsi_initiator panic o kern/154432 scsi [xpt] run_interrupt_driven_hooks: still waiting after o kern/153514 scsi [cam] [panic] CAM related panic o kern/153361 scsi [ciss] Smart Array 5300 boot/detect drive problem o kern/152250 scsi [ciss] [patch] Kernel panic when hw.ciss.expose_hidden o kern/151564 scsi [ciss] ciss(4) should increase CISS_MAX_LOGICAL to 10 o docs/151336 scsi Missing documentation of scsi_ and ata_ functions in c s kern/149927 scsi [cam] hard drive not stopped before removing power dur o kern/148083 scsi [aac] Strange device reporting o kern/147704 scsi [mpt] sys/dev/mpt: new chip revision, partially unsupp o kern/146287 scsi [ciss] ciss(4) cannot see more than one SmartArray con o kern/145768 scsi [mpt] can't perform I/O on SAS based SAN disk in freeb o kern/144648 scsi [aac] Strange values of speed and bus width in dmesg o kern/144301 scsi [ciss] [hang] HP proliant server locks when using ciss o kern/142351 scsi [mpt] LSILogic driver performance problems o kern/134488 scsi [mpt] MPT SCSI driver probes max. 8 LUNs per device o kern/132250 scsi [ciss] ciss driver does not support more then 15 drive o kern/132206 scsi [mpt] system panics on boot when mirroring and 2nd dri o kern/130621 scsi [mpt] tranfer rate is inscrutable slow when use lsi213 o kern/129602 scsi [ahd] ahd(4) gets confused and wedges SCSI bus o kern/128452 scsi [sa] [panic] Accessing SCSI tape drive randomly crashe o kern/128245 scsi [scsi] "inquiry data fails comparison at DV1 step" [re o kern/127927 scsi [isp] isp(4) target driver crashes kernel when set up o kern/127717 scsi [ata] [patch] [request] - support write cache toggling o kern/123674 scsi [ahc] ahc driver dumping o kern/123520 scsi [ahd] unable to boot from net while using ahd o sparc/121676 scsi [iscsi] iscontrol do not connect iscsi-target on sparc o kern/120487 scsi [sg] scsi_sg incompatible with scanners o kern/120247 scsi [mpt] FreeBSD 6.3 and LSI Logic 1030 = only 3.300MB/s o kern/114597 scsi [sym] System hangs at SCSI bus reset with dual HBAs o kern/110847 scsi [ahd] Tyan U320 onboard problem with more than 3 disks o kern/99954 scsi [ahc] reading from DVD failes on 6.x [regression] o kern/92798 scsi [ahc] SCSI problem with timeouts o kern/90282 scsi [sym] SCSI bus resets cause loss of ch device o kern/76178 scsi [ahd] Problem with ahd and large SCSI Raid system o kern/74627 scsi [ahc] [hang] Adaptec 2940U2W Can't boot 5.3 s kern/61165 scsi [panic] kernel page fault after calling cam_send_ccb o kern/60641 scsi [sym] Sporadic SCSI bus resets with 53C810 under load o kern/60598 scsi wire down of scsi devices conflicts with config s kern/57398 scsi [mly] Current fails to install on mly(4) based RAID di o kern/52638 scsi [panic] SCSI U320 on SMP server won't run faster than o kern/44587 scsi dev/dpt/dpt.h is missing defines required for DPT_HAND o kern/39388 scsi ncr/sym drivers fail with 53c810 and more than 256MB m o kern/35234 scsi World access to /dev/pass? (for scanner) requires acce 56 problems total. From owner-freebsd-scsi@FreeBSD.ORG Wed Nov 14 17:59:37 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4CFA4C56; Wed, 14 Nov 2012 17:59:37 +0000 (UTC) (envelope-from prvs=166515a72b=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id A266A8FC13; Wed, 14 Nov 2012 17:59:36 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001058451.msg; Wed, 14 Nov 2012 17:59:34 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Wed, 14 Nov 2012 17:59:34 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=166515a72b=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <6A9D5119A4774E8C8C8E0427035FC05B@multiplay.co.uk> From: "Steven Hartland" To: "Doug Ambrisko" References: <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk> <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk> <20121105212911.GA17904@ambrisko.com> <27169C7FE704495087A093752D15E7B6@multiplay.co.uk> <20121106180152.GA40422@ambrisko.com> <6B5B65F4FC854EB8BBC701500096602E@multiplay.co.uk> <0B4E8AFF9DA04C6EBD2496A8B58F1D67@multiplay.co.uk> <20121109172508.GA13333@ambrisko.com> Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock Date: Wed, 14 Nov 2012 17:59:36 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Nov 2012 17:59:37 -0000 ----- Original Message ----- From: "Doug Ambrisko" To: "Steven Hartland" Cc: ; Sent: Friday, November 09, 2012 5:25 PM Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock > On Fri, Nov 09, 2012 at 05:06:03PM -0000, Steven Hartland wrote: > | > | ----- Original Message ----- > | From: "Steven Hartland" > | ... > | >I've just had another panic, trace below, but it doesn't seem to be related > | >to my changes so I'd appreciate your feedback on them as they are for now. > | > > | >While the lock patch fixes the problems I've seen, its not clear to me > | >why mfi_tbolt_reset is acquiring the lock and hence requiring > | >mfi_process_fw_state_chg_isr to jump through hoops to ensure locking > | >around queue manipulation is done correctly. Given what its doing > | >(resetting the entire adapter) I wouldn't be surprised if it should > | >really be acquiring the config lock. > | > > | >Other things I've noticed / questions > | >* Should mfi_abort sleep even if its call to mfi_mapcmd fails? > | >* Should mfi_get_controller_info really ignore the error from mfi_mapcmd? > | >* Do these controllers not support none 512 byte requests? Currently > | >all syspd requests are done assuming 512 byte sectors which the disk may > | >not be. This will both reduce performance or potentially break totally > | >if the firmware isn't translating it under the surface correctly. > | > > | >Anyway the new panic manually transcribed is:- > | >panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm > | >... > | >mfi_tbolt_get_cmd() > | >mfi_build_mpt_pass_thru() > | >mfi_tbolt_build_mpt_cmd() > | >mfi_tbolt_send_frame() > | >bus_dmamap_load() > | >mfi_mapcmd() > | >mfi_startio() > | >mfi_syspd_strategy() > | >g_disk_start() > | >g_io_schedule_down() > | >g_down_proc_body() > | >fork_exit() > | >fork_trampoline() > | > > | >Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I > | >can tell all manip is done using the TAILQ macros and under mfi_io_lock > | >so its not obvious to me at this time why this is, any ideas? > | > | I've gone through looking for the possible cause of this and while there's > | nothing directly connected to the manip of this queue I've found and fixed > | quite a large number of additional problems which may have been indirectly > | causing this problem. > | > | The biggest change is to use mfi_max_cmds to limit the value stored in > | sc->mfi_max_fw_cmds as this is used extensively throughout the driver > | for allocation and range checks so having this inconsitently set opened up > | a large number of possible overrun errors. > | > | The new patch attached documents all the changes in detail. > | > | I've managed to do one test run so far which failed to reproduce any panics, > | so definitely moving in the right direction :) > | > | The machine has now been collected for repair by the supplier but I'm going > | to try and get them to put it online for more testing over the weekend. > | > | Given the failure rate so far if I can do another 4 runs with no panics I'd > | be happy that the majority of error conditions are working as expected. > > Sounds like you have made some good progress. I looked at your prior locking > change and they good. Haven't had time to go through the queue changes > yet. Just to update people on this, as its taken quite some time to track down the random issues causing panics, but I believe I made a breakthrough last night. It seems that the cleanup interation between mfi_cmd's and tbolt_cmd's is flawed meaning its possible that tbolt commands are processed after the caller has already recieved a response, cleaned and returned the mfi_cmd to the free queue. This means that its anyones guess what the result of the tbolt cleanup is as it could well be operating on a mfi_cmd thats either now in the free queue or even worse has already been reused. It also possible this was the underling issue you may well have seening which caused you to add the mfi_tbolt_complete_cmd calls to mfi_tbolt_send_frame in r242681. If this is correct then I believe the correct fix is to ensure that mfi_tbolt_return_cmd is only ever called from mfi_release_command thus ensuring completion ordering is always correct. I'm testing fixes for this theory now but initial debug has had good results. The patch of fixes is really growing, so definitely going to need someone to review in detail when I'm done. What do you think of the above, does it make sence? Would you be willing to review the patch when I'm done, before I commit it Doug? Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-scsi@FreeBSD.ORG Fri Nov 16 23:39:26 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7BC99A63; Fri, 16 Nov 2012 23:39:26 +0000 (UTC) (envelope-from prvs=1667f8fa28=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id B98BA8FC08; Fri, 16 Nov 2012 23:39:25 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001083906.msg; Fri, 16 Nov 2012 23:39:17 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Fri, 16 Nov 2012 23:39:17 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1667f8fa28=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <5AD5215CCD2D4693B211A4F6D076987D@multiplay.co.uk> From: "Steven Hartland" To: "Doug Ambrisko" References: <2DC1C56CFFF24FE0B17C34AD21A7DFAA@multiplay.co.uk> <39D16C43C8274CE9B8F23C18459E2FD4@multiplay.co.uk> <20121105212911.GA17904@ambrisko.com> <27169C7FE704495087A093752D15E7B6@multiplay.co.uk> <20121106180152.GA40422@ambrisko.com> <6B5B65F4FC854EB8BBC701500096602E@multiplay.co.uk> <0B4E8AFF9DA04C6EBD2496A8B58F1D67@multiplay.co.uk> <20121109172508.GA13333@ambrisko.com> <6A9D5119A4774E8C8C8E0427035FC05B@multiplay.co.uk> Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock Date: Fri, 16 Nov 2012 23:39:18 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: freebsd-scsi@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Nov 2012 23:39:26 -0000 ----- Original Message ----- From: "Steven Hartland" >> Sounds like you have made some good progress. I looked at your prior locking >> change and they good. Haven't had time to go through the queue changes >> yet. > > Just to update people on this, as its taken quite some time to track down the > random issues causing panics, but I believe I made a breakthrough last night. > > It seems that the cleanup interation between mfi_cmd's and tbolt_cmd's is flawed > meaning its possible that tbolt commands are processed after the caller has > already recieved a response, cleaned and returned the mfi_cmd to the free queue. > > This means that its anyones guess what the result of the tbolt cleanup is as it > could well be operating on a mfi_cmd thats either now in the free queue or even > worse has already been reused. > > It also possible this was the underling issue you may well have seening which > caused you to add the mfi_tbolt_complete_cmd calls to mfi_tbolt_send_frame > in r242681. > > If this is correct then I believe the correct fix is to ensure that > mfi_tbolt_return_cmd is only ever called from mfi_release_command thus ensuring > completion ordering is always correct. I'm testing fixes for this theory now > but initial debug has had good results. > > The patch of fixes is really growing, so definitely going to need someone to > review in detail when I'm done. > > What do you think of the above, does it make sence? Would you be willing to > review the patch when I'm done, before I commit it Doug? Ok I think I'm done. The good news is I've managed to fix all panics and cases of commands being processed incorrectly that we've seen here. The bad news is the patch is now really quite large as there was a lot if issues found during debugging of the core problems. The main fixes are:- 1. Ensure that IO lock is not dropped during tbolt ISR processing, as this can cause some very nasty issues when two threads end up processing the same tbolt cmd. 2. Ensure that interaction between mfi_cmd's and tbolt_cmd's, specifically in their cleanup, total number and range checks as if this isn't done then again some very nasty issues can occur. 3. Ensure that tbolt init doesn't break MFI indexing by assuming it always gets the first mfi command structure. The reset of the fixes are for things like potential NULL pointer exceptions, locks not being dropped during error cases etc. Full details of all the fixes are in the patch which can be found here:- http://blog.multiplay.co.uk/dropzone/freebsd/zz-mfi-queue.patch It should be noted that while the changes now make the driver functionally correct, the promotion of the IO lock to the upper layers isn't ideal and could do with optimising. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.