From owner-freebsd-scsi@FreeBSD.ORG Mon Oct 1 11:07:27 2012 Return-Path: Delivered-To: freebsd-scsi@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 61D5410656B4 for ; Mon, 1 Oct 2012 11:07:27 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 4AF858FC15 for ; Mon, 1 Oct 2012 11:07:27 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q91B7RYO025091 for ; Mon, 1 Oct 2012 11:07:27 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q91B7QMb025089 for freebsd-scsi@FreeBSD.org; Mon, 1 Oct 2012 11:07:26 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 1 Oct 2012 11:07:26 GMT Message-Id: <201210011107.q91B7QMb025089@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-scsi@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-scsi@FreeBSD.org X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Oct 2012 11:07:27 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/169976 scsi [cam] [patch] make scsi_da use sysctl values where app o kern/169974 scsi [cam] [patch] add Quirks for SSD that are 4k optimised o kern/169835 scsi [patch] remove some unused variables from scsi_da prob o kern/169801 scsi [cam] [patc] make changes to delete_method in scsi_da o kern/169403 scsi [cam] [patch] CAM layer, I/O starvation, no fairness o kern/165982 scsi [mpt] mpt instability, drive resets, and losses on Fre o kern/165740 scsi [cam] SCSI code must drain callbacks before free o kern/163713 scsi [aic7xxx] [patch] Add Adaptec29329LPE to aic79xx_pci.c o kern/162256 scsi [mpt] QUEUE FULL EVENT and 'mpt_cam_event: 0x0' o kern/161809 scsi [cam] [patch] set kern.cam.boot_delay via build option o kern/159412 scsi [ciss] 7.3 RELEASE: ciss0 ADAPTER HEARTBEAT FAILED err o kern/157770 scsi [iscsi] [panic] iscsi_initiator panic o kern/154432 scsi [xpt] run_interrupt_driven_hooks: still waiting after o kern/153514 scsi [cam] [panic] CAM related panic o kern/153361 scsi [ciss] Smart Array 5300 boot/detect drive problem o kern/152250 scsi [ciss] [patch] Kernel panic when hw.ciss.expose_hidden o kern/151564 scsi [ciss] ciss(4) should increase CISS_MAX_LOGICAL to 10 o docs/151336 scsi Missing documentation of scsi_ and ata_ functions in c s kern/149927 scsi [cam] hard drive not stopped before removing power dur o kern/148083 scsi [aac] Strange device reporting o kern/147704 scsi [mpt] sys/dev/mpt: new chip revision, partially unsupp o kern/146287 scsi [ciss] ciss(4) cannot see more than one SmartArray con o kern/145768 scsi [mpt] can't perform I/O on SAS based SAN disk in freeb o kern/144648 scsi [aac] Strange values of speed and bus width in dmesg o kern/144301 scsi [ciss] [hang] HP proliant server locks when using ciss o kern/142351 scsi [mpt] LSILogic driver performance problems o kern/134488 scsi [mpt] MPT SCSI driver probes max. 8 LUNs per device o kern/132250 scsi [ciss] ciss driver does not support more then 15 drive o kern/132206 scsi [mpt] system panics on boot when mirroring and 2nd dri o kern/130621 scsi [mpt] tranfer rate is inscrutable slow when use lsi213 o kern/129602 scsi [ahd] ahd(4) gets confused and wedges SCSI bus o kern/128452 scsi [sa] [panic] Accessing SCSI tape drive randomly crashe o kern/128245 scsi [scsi] "inquiry data fails comparison at DV1 step" [re o kern/127927 scsi [isp] isp(4) target driver crashes kernel when set up o kern/127717 scsi [ata] [patch] [request] - support write cache toggling o kern/123674 scsi [ahc] ahc driver dumping o kern/123520 scsi [ahd] unable to boot from net while using ahd o sparc/121676 scsi [iscsi] iscontrol do not connect iscsi-target on sparc o kern/120487 scsi [sg] scsi_sg incompatible with scanners o kern/120247 scsi [mpt] FreeBSD 6.3 and LSI Logic 1030 = only 3.300MB/s o kern/114597 scsi [sym] System hangs at SCSI bus reset with dual HBAs o kern/110847 scsi [ahd] Tyan U320 onboard problem with more than 3 disks o kern/99954 scsi [ahc] reading from DVD failes on 6.x [regression] o kern/92798 scsi [ahc] SCSI problem with timeouts o kern/90282 scsi [sym] SCSI bus resets cause loss of ch device o kern/76178 scsi [ahd] Problem with ahd and large SCSI Raid system o kern/74627 scsi [ahc] [hang] Adaptec 2940U2W Can't boot 5.3 s kern/61165 scsi [panic] kernel page fault after calling cam_send_ccb o kern/60641 scsi [sym] Sporadic SCSI bus resets with 53C810 under load o kern/60598 scsi wire down of scsi devices conflicts with config s kern/57398 scsi [mly] Current fails to install on mly(4) based RAID di o kern/52638 scsi [panic] SCSI U320 on SMP server won't run faster than o kern/44587 scsi dev/dpt/dpt.h is missing defines required for DPT_HAND o kern/39388 scsi ncr/sym drivers fail with 53c810 and more than 256MB m o kern/35234 scsi World access to /dev/pass? (for scanner) requires acce 55 problems total. From owner-freebsd-scsi@FreeBSD.ORG Fri Oct 5 03:53:01 2012 Return-Path: Delivered-To: freebsd-scsi@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BAF7E106566B; Fri, 5 Oct 2012 03:53:01 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 8DAF18FC0C; Fri, 5 Oct 2012 03:53:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q953r1JA073175; Fri, 5 Oct 2012 03:53:01 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q953r0rA073170; Fri, 5 Oct 2012 03:53:00 GMT (envelope-from linimon) Date: Fri, 5 Oct 2012 03:53:00 GMT Message-Id: <201210050353.q953r0rA073170@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-scsi@FreeBSD.org From: linimon@FreeBSD.org Cc: Subject: Re: kern/171650: [da] da(4) driver does not recognize end of cciss (SmartArray) >volume reconstruction X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Oct 2012 03:53:01 -0000 Old Synopsis: ``da'' driver does not recognize end of cciss (SmartArray) >volume reconstruction New Synopsis: [da] da(4) driver does not recognize end of cciss (SmartArray) >volume reconstruction Responsible-Changed-From-To: freebsd-bugs->freebsd-scsi Responsible-Changed-By: linimon Responsible-Changed-When: Fri Oct 5 03:52:35 UTC 2012 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=171650 From owner-freebsd-scsi@FreeBSD.ORG Fri Oct 5 13:13:56 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 821) id C387E106564A; Fri, 5 Oct 2012 13:13:56 +0000 (UTC) Date: Fri, 5 Oct 2012 13:13:56 +0000 From: John To: FreeBSD-FS Message-ID: <20121005131356.GA13888@FreeBSD.org> References: <20121003032738.GA42140@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121003032738.GA42140@FreeBSD.org> User-Agent: Mutt/1.4.2.1i Cc: FreeBSD-SCSI Subject: Re: ZFS/istgt lockup X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Oct 2012 13:13:56 -0000 Copying this reply to -scsi. Not sure if it's more of a zfs issue or istgt... more below... ----- John's Original Message ----- > Hi Folks, > > I've been chasing a problem that I'm not quite sure originates > on the BSD side, but the system shouldn't lock up and require a power > cycle to reboot. > > The config: I have a bsd system running 9.1RC handing out a > 36TB volume to a Linux RHEL 6.1 system. The RHEL 6.1 systems is > doing heavy I/O & number crunching. Many hours into the job stream > the kernel becomes quite unhappy: > > kernel: __ratelimit: 27665 callbacks suppressed > kernel: swapper: page allocation failure. order:1, mode:0x4020 > kernel: Pid: 0, comm: swapper Tainted: G ---------------- T 2.6.32-131.0.15.el6.x86_64 #1 > kernel: Call Trace: > kernel: [] ? __alloc_pages_nodemask+0x716/0x8b0 > kernel: [] ? alloc_pages_current+0xaa/0x110 > kernel: [] ? refill_fl+0x3d5/0x4a0 [cxgb3] > kernel: [] ? napi_frags_finish+0x6d/0xb0 > kernel: [] ? process_responses+0x653/0x1450 [cxgb3] > kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 > kernel: [] ? napi_rx_handler+0x3c/0x90 [cxgb3] > kernel: [] ? net_rx_action+0x103/0x2f0 > kernel: [] ? __do_softirq+0xb7/0x1e0 > kernel: [] ? handle_IRQ_event+0xf6/0x170 > kernel: [] ? call_softirq+0x1c/0x30 > kernel: [] ? do_softirq+0x65/0xa0 > kernel: [] ? irq_exit+0x85/0x90 > kernel: [] ? do_IRQ+0x75/0xf0 > kernel: [] ? ret_from_intr+0x0/0x11 > kernel: [] ? native_safe_halt+0xb/0x10 > kernel: [] ? ftrace_raw_event_power_start+0x16/0x20 > kernel: [] ? default_idle+0x4d/0xb0 > kernel: [] ? cpu_idle+0xb6/0x110 > kernel: [] ? start_secondary+0x202/0x245 > > On the bsd side, the istgt daemon appears to see that one of the > connection threads is down and attempts to restart it. At this point, > the istgt process size starts to grow. > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > root 1224 0.0 0.4 8041092 405472 v0- DL 4:59PM 15:28.72 /usr/local/bin/istgt > root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 63:18.34 /usr/local/bin/istgt > root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 61:13.80 /usr/local/bin/istgt > root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 0:00.00 /usr/local/bin/istgt > > There are more than 1400 threads reported. > > Also of interest, netstat shows: > > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54076 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33345 CLOSED > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54074 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33343 CLOSED > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54072 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33341 CLOSED > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33339 CLOSED > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54070 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.53806 CLOSE_WAIT > > There are more than 1400 sockets in the CLOSE* state. What would > prevent these sockets from cleaning up in a reasonable timeframe? > Both sides of the mpio connection appear to be attempting reconnects. > > An attempt to gracefully kill istgt fails. A kill -9 does not clean > things up either. > > A procstat -kk 1224 after the kill -9 shows: > > PID TID COMM TDNAME KSTACK > 1224 100959 istgt sigthread mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_ > lookup_norm+0x45 zap_lookup+0x2e zfs_dirent_lock+0x4ff zfs_dirlook+0x69 zfs_lookup+0x26b zfs_freebsd_lookup+0x81 vfs_cache_lookup+0xf8 VOP_LOOKUP_APV+0x40 lookup+0x > 464 namei+0x4e9 vn_open_cred+0x3cb > 1224 100960 istgt luthread #1 mi_switch+0x186 sleepq_wait+0x42 _sleep+0x376 bwait+0x64 physio+0x246 devfs_write_f+0x8d dofilewrite+0x8b kern_writev > +0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 > 1224 103533 istgt sendthread #1493 mi_switch+0x186 thread_suspend_switch+0xc9 thread_single+0x1b2 exit1+0x72 sigexit+0x7c postsig+0x3a4 ast+0x26c doreti > _ast+0x1f > > > An attempt to forcefully export the pool hangs also. A procstat > shows: > > PID TID COMM TDNAME KSTACK > 4427 100991 zpool - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 dbuf_read+0x30b dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_lookup_norm+0x45 zap_lookup+0x2e dsl_dir_open_spa+0x121 dsl_dataset_hold+0x3b dmu_objset_hold+0x23 zfs_ioc_objset_stats+0x2b zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x546 Xfast_syscall+0xf7 > > > > If anyone has any ideas, please let me know. I know I've left a lot > of config information out in an attempt to keep the email shorter. > > Random comments: > > This happens with or without multipathd enabled on the linux client. > > If I catch the istgt daemon while it's creating threads and kill it > the system will not lock up. > > I see no errors in the istgt log file. One of my next things to try > is to enable all debugging... The amount of debugging data captured > is quite large :-( > > I am using chelsio 10G cards on both client/server which have been > rock solid in all other cases. > > Thoughts welcome! > > Thanks, > John Hi Folks, I've managed to replicate this problem once. Basically, it appears the linux client sends an abort which is processed here: istgt_iscsi_op_task: switch (function) { case ISCSI_TASK_FUNC_ABORT_TASK: ISTGT_LOG("ABORT_TASK\n"); SESS_MTX_LOCK(conn); rc = istgt_lu_clear_task_ITLQ(conn, conn->sess->lu, lun, ref_CmdSN); SESS_MTX_UNLOCK(conn); if (rc < 0) { ISTGT_ERRLOG("LU reset failed\n"); } istgt_clear_transfer_task(conn, ref_CmdSN); break; At this point, the queue depth is 62. There appears to be one thread in the zfs code performing a read. No other processing occurs after this point. A zfs list hangs. The pool cannot be exported. The istgt daemon cannot be fully killed. A reboot requires a power reset (ie: reboot hangs after flushing buffers). The only thing that does appear to be happening is a growing list of connections: tcp4 0 0 10.60.6.12.5010 10.60.25.113.56577 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56576 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56575 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56574 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56573 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56572 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56571 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56570 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56569 CLOSE_WAIT Currently, about 390 and slowly going up. This implies to me that there is some sort of reconnect ocurring that is failing. On the client side, I think the problem is related to a Chelsio N320 10G nic which is showing RX overflows. After showing about 40000 overflows the ABORT was received on the server side. I've never seen a chelsio card have overflow problems. The server is using the same model chelsio card with no issues. Again, any thoughts/comments are welcome! Thanks, John