From owner-freebsd-scsi@FreeBSD.ORG Fri Oct 5 13:13:56 2012 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 821) id C387E106564A; Fri, 5 Oct 2012 13:13:56 +0000 (UTC) Date: Fri, 5 Oct 2012 13:13:56 +0000 From: John To: FreeBSD-FS Message-ID: <20121005131356.GA13888@FreeBSD.org> References: <20121003032738.GA42140@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121003032738.GA42140@FreeBSD.org> User-Agent: Mutt/1.4.2.1i Cc: FreeBSD-SCSI Subject: Re: ZFS/istgt lockup X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Oct 2012 13:13:56 -0000 Copying this reply to -scsi. Not sure if it's more of a zfs issue or istgt... more below... ----- John's Original Message ----- > Hi Folks, > > I've been chasing a problem that I'm not quite sure originates > on the BSD side, but the system shouldn't lock up and require a power > cycle to reboot. > > The config: I have a bsd system running 9.1RC handing out a > 36TB volume to a Linux RHEL 6.1 system. The RHEL 6.1 systems is > doing heavy I/O & number crunching. Many hours into the job stream > the kernel becomes quite unhappy: > > kernel: __ratelimit: 27665 callbacks suppressed > kernel: swapper: page allocation failure. order:1, mode:0x4020 > kernel: Pid: 0, comm: swapper Tainted: G ---------------- T 2.6.32-131.0.15.el6.x86_64 #1 > kernel: Call Trace: > kernel: [] ? __alloc_pages_nodemask+0x716/0x8b0 > kernel: [] ? alloc_pages_current+0xaa/0x110 > kernel: [] ? refill_fl+0x3d5/0x4a0 [cxgb3] > kernel: [] ? napi_frags_finish+0x6d/0xb0 > kernel: [] ? process_responses+0x653/0x1450 [cxgb3] > kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 > kernel: [] ? napi_rx_handler+0x3c/0x90 [cxgb3] > kernel: [] ? net_rx_action+0x103/0x2f0 > kernel: [] ? __do_softirq+0xb7/0x1e0 > kernel: [] ? handle_IRQ_event+0xf6/0x170 > kernel: [] ? call_softirq+0x1c/0x30 > kernel: [] ? do_softirq+0x65/0xa0 > kernel: [] ? irq_exit+0x85/0x90 > kernel: [] ? do_IRQ+0x75/0xf0 > kernel: [] ? ret_from_intr+0x0/0x11 > kernel: [] ? native_safe_halt+0xb/0x10 > kernel: [] ? ftrace_raw_event_power_start+0x16/0x20 > kernel: [] ? default_idle+0x4d/0xb0 > kernel: [] ? cpu_idle+0xb6/0x110 > kernel: [] ? start_secondary+0x202/0x245 > > On the bsd side, the istgt daemon appears to see that one of the > connection threads is down and attempts to restart it. At this point, > the istgt process size starts to grow. > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > root 1224 0.0 0.4 8041092 405472 v0- DL 4:59PM 15:28.72 /usr/local/bin/istgt > root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 63:18.34 /usr/local/bin/istgt > root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 61:13.80 /usr/local/bin/istgt > root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 0:00.00 /usr/local/bin/istgt > > There are more than 1400 threads reported. > > Also of interest, netstat shows: > > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54076 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33345 CLOSED > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54074 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33343 CLOSED > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54072 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33341 CLOSED > tcp4 0 0 10.60.6.12.5010 10.60.25.113.33339 CLOSED > tcp4 0 0 10.59.6.12.5010 10.59.25.113.54070 CLOSE_WAIT > tcp4 0 0 10.60.6.12.5010 10.60.25.113.53806 CLOSE_WAIT > > There are more than 1400 sockets in the CLOSE* state. What would > prevent these sockets from cleaning up in a reasonable timeframe? > Both sides of the mpio connection appear to be attempting reconnects. > > An attempt to gracefully kill istgt fails. A kill -9 does not clean > things up either. > > A procstat -kk 1224 after the kill -9 shows: > > PID TID COMM TDNAME KSTACK > 1224 100959 istgt sigthread mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_ > lookup_norm+0x45 zap_lookup+0x2e zfs_dirent_lock+0x4ff zfs_dirlook+0x69 zfs_lookup+0x26b zfs_freebsd_lookup+0x81 vfs_cache_lookup+0xf8 VOP_LOOKUP_APV+0x40 lookup+0x > 464 namei+0x4e9 vn_open_cred+0x3cb > 1224 100960 istgt luthread #1 mi_switch+0x186 sleepq_wait+0x42 _sleep+0x376 bwait+0x64 physio+0x246 devfs_write_f+0x8d dofilewrite+0x8b kern_writev > +0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 > 1224 103533 istgt sendthread #1493 mi_switch+0x186 thread_suspend_switch+0xc9 thread_single+0x1b2 exit1+0x72 sigexit+0x7c postsig+0x3a4 ast+0x26c doreti > _ast+0x1f > > > An attempt to forcefully export the pool hangs also. A procstat > shows: > > PID TID COMM TDNAME KSTACK > 4427 100991 zpool - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 dbuf_read+0x30b dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_lookup_norm+0x45 zap_lookup+0x2e dsl_dir_open_spa+0x121 dsl_dataset_hold+0x3b dmu_objset_hold+0x23 zfs_ioc_objset_stats+0x2b zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x546 Xfast_syscall+0xf7 > > > > If anyone has any ideas, please let me know. I know I've left a lot > of config information out in an attempt to keep the email shorter. > > Random comments: > > This happens with or without multipathd enabled on the linux client. > > If I catch the istgt daemon while it's creating threads and kill it > the system will not lock up. > > I see no errors in the istgt log file. One of my next things to try > is to enable all debugging... The amount of debugging data captured > is quite large :-( > > I am using chelsio 10G cards on both client/server which have been > rock solid in all other cases. > > Thoughts welcome! > > Thanks, > John Hi Folks, I've managed to replicate this problem once. Basically, it appears the linux client sends an abort which is processed here: istgt_iscsi_op_task: switch (function) { case ISCSI_TASK_FUNC_ABORT_TASK: ISTGT_LOG("ABORT_TASK\n"); SESS_MTX_LOCK(conn); rc = istgt_lu_clear_task_ITLQ(conn, conn->sess->lu, lun, ref_CmdSN); SESS_MTX_UNLOCK(conn); if (rc < 0) { ISTGT_ERRLOG("LU reset failed\n"); } istgt_clear_transfer_task(conn, ref_CmdSN); break; At this point, the queue depth is 62. There appears to be one thread in the zfs code performing a read. No other processing occurs after this point. A zfs list hangs. The pool cannot be exported. The istgt daemon cannot be fully killed. A reboot requires a power reset (ie: reboot hangs after flushing buffers). The only thing that does appear to be happening is a growing list of connections: tcp4 0 0 10.60.6.12.5010 10.60.25.113.56577 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56576 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56575 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56574 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56573 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56572 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56571 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56570 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.56569 CLOSE_WAIT Currently, about 390 and slowly going up. This implies to me that there is some sort of reconnect ocurring that is failing. On the client side, I think the problem is related to a Chelsio N320 10G nic which is showing RX overflows. After showing about 40000 overflows the ABORT was received on the server side. I've never seen a chelsio card have overflow problems. The server is using the same model chelsio card with no issues. Again, any thoughts/comments are welcome! Thanks, John