From owner-freebsd-stable@FreeBSD.ORG Mon Aug 18 08:20:55 2014 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 71BBB996 for ; Mon, 18 Aug 2014 08:20:55 +0000 (UTC) Received: from sinkng.sics.se (unknown [IPv6:2001:6b0:3a:1:c654:44ff:fe45:117c]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 14EEB3973 for ; Mon, 18 Aug 2014 08:20:54 +0000 (UTC) Received: from P142s.sics.se (P142s.sics.se [193.10.66.127]) by sinkng.sics.se (8.14.9/8.14.9) with ESMTP id s7I8Kqq4027512 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 18 Aug 2014 10:20:52 +0200 (CEST) (envelope-from bengta@P142s.sics.se) Received: from P142s.sics.se (localhost [127.0.0.1]) by P142s.sics.se (8.14.9/8.14.9) with ESMTP id s7I8KlKl001961; Mon, 18 Aug 2014 10:20:47 +0200 (CEST) (envelope-from bengta@P142s.sics.se) Received: (from bengta@localhost) by P142s.sics.se (8.14.9/8.14.9/Submit) id s7I8KlMr001960; Mon, 18 Aug 2014 10:20:47 +0200 (CEST) (envelope-from bengta@P142s.sics.se) From: Bengt Ahlgren To: stable@freebsd.org Subject: Re: ZFS deadlock? In-Reply-To: (Bengt Ahlgren's message of "Fri, 15 Aug 2014 16:34:11 +0200") References: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (berkeley-unix) Date: Mon, 18 Aug 2014 10:20:47 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 08:20:55 -0000 Bengt Ahlgren writes: > During a copy (zfs send/recv) of a ~1TB dataset from one zpool to > another, my system seems to run into some issues. A simultaneous "find" > on the source data set deadlocks. This is the kernel stack: > > $ procstat -kk 1786 > PID TID COMM TDNAME KSTACK > 1786 101344 find - mi_switch+0x194 sleepq_wait+0x42 _cv_wait+0x112 zio_wait+0x61 dbuf_read+0x619 dmu_buf_hold+0xe0 zap_get_leaf_byblk+0x4a zap_deref_leaf+0x68 fzap_cursor_retrieve+0xe7 zap_cursor_retrieve+0x155 zfs_freebsd_readdir+0x2d8 VOP_READDIR_APV+0x78 kern_getdirentries+0x212 sys_getdirentries+0x23 amd64_syscall+0x5ea Xfast_syscall+0xf7 > > The zfs send/recv has gotten very slow, albeit seems to make very slow > progress (copy is, as obvious, from p0 to p2): > > p0 15.9T 2.20T 318 0 10.2M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 294 0 9.29M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 307 0 9.12M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 293 0 8.69M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 58 0 1.61M > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 301 0 10.9M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 1.62K 0 49.6M > ---------- ----- ----- ----- ----- ----- ----- > > The machine is otherwise quite idle. When the copy started, I got > around 200MB/s, now it's around 10MB/s. > > The ARC has gotten large, but that is likely normal: > > last pid: 1863; load averages: 0.20, 0.33, 0.63 up 0+02:27:44 16:31:52 > 50 processes: 1 running, 49 sleeping > CPU: 0.0% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.8% idle > Mem: 1688M Active, 61M Inact, 107G Wired, 3288K Cache, 126M Buf, 15G Free > ARC: 99G Total, 2483M MFU, 89G MRU, 33M Anon, 888M Header, 7427M Other > Swap: 128G Total, 128G Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 1229 root 1 20 0 39700K 3292K piperd 7 24:27 1.07% zfs > 1228 root 2 20 0 39832K 3420K nanslp 5 17:02 0.39% zfs > ... > > The source pool is pretty filled up, can that be an issue? > > $ zpool list > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > p0 18.1T 15.9T 2.20T 87% 1.00x ONLINE - > p1 18.1T 11.1T 7.00T 61% 1.00x ONLINE - > p2 43.5T 2.53T 41.0T 5% 1.00x ONLINE - > > The machine is running 9.3-REL and has two mps controllers. > > Any ideas? Just for the record: there was no deadlock after all. It turned out to be caused by a directory with ~4.5M entries. Bengt