From owner-freebsd-fs@freebsd.org Tue Dec 15 12:52:39 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 88F81A43230 for ; Tue, 15 Dec 2015 12:52:39 +0000 (UTC) (envelope-from bengta@sics.se) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 686601BC2 for ; Tue, 15 Dec 2015 12:52:39 +0000 (UTC) (envelope-from bengta@sics.se) Received: by mailman.ysv.freebsd.org (Postfix) id 6733EA4322D; Tue, 15 Dec 2015 12:52:39 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 66D73A4322C for ; Tue, 15 Dec 2015 12:52:39 +0000 (UTC) (envelope-from bengta@sics.se) Received: from mail-lf0-x22b.google.com (mail-lf0-x22b.google.com [IPv6:2a00:1450:4010:c07::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E3A9C1BC1 for ; Tue, 15 Dec 2015 12:52:38 +0000 (UTC) (envelope-from bengta@sics.se) Received: by mail-lf0-x22b.google.com with SMTP id y184so5645862lfc.1 for ; Tue, 15 Dec 2015 04:52:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sics-se.20150623.gappssmtp.com; s=20150623; h=from:to:subject:user-agent:date:message-id:mime-version :content-type; bh=iH75G/zlfUZHTJHPK473RFsZ+RhHBZv3KovxYUJZWPc=; b=OEGb9L7gg1nqKb8lb376Ch0T/C2IQV8VVPXpXVCIQvvdUQlwBKqxMqqOl6Gs5RopvL E5r9kwBfldLJIeGuunk+Kq+X+XFLP07MW+1Jxea0bUml3NeVX64chipmUr27wUd8WY6Q KkTSYBbNaTJoC7TMvwjOsBZkXklcjUe7xni2Cul9DQc54SM0KVdQXtKFSS6Lz2Evaad1 u9scLp2/iEjIqeJ82Dnqp/GqBtFlabI4Bj6sObV01GmhZ3NG8+Nf9bi7L4ahFEO5bMbd oEADU/kBvMS6Mlu2NK81+JB2sEHC+aPPnGV+DMcoC1M8+tmWFsvtMb6Iyq8lr9yLOylP fN+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:subject:user-agent:date:message-id :mime-version:content-type; bh=iH75G/zlfUZHTJHPK473RFsZ+RhHBZv3KovxYUJZWPc=; b=Spn8qXrqrGcY2DMBXDbvQNbCCGMoesR4LRGM3nJeBQJKy9DOPVuply83LLAzoypBph DBmVMx3oNzhE8SbhoiLVpllzeEqAK7bQoshWvL6cX0Rsp3flWhrhJXU5a8rAZpFuYy25 VTPQihyLds+jGvAVNvHYLZ69kCBICXt/m8iw5zI6eu1gB+QVCSEOiWvHVmMOaTLaoP4g TdkQMgqpI0ZcCzrg9OS4v/9mIDOC995w4dVAeXyN7pfKdSR3GPfBDGY9o2I28xlkzxs8 rLDcnIDmk58C/I0nzMVxx7CAeYFZp9R6vSxlrFFPlMo8m8pI0bdKstT7/xlXv2IXHaRY pI4Q== X-Gm-Message-State: ALoCoQlS1Nwq8cogWc27gEdfahbvC9bYykNw3sJejkpUAFQcuWhmrxA9uUIMOLbENw2gfd5nM3NmH9eKtBhbaQCbtc7/gsh7bQ== X-Received: by 10.25.37.137 with SMTP id l131mr8359609lfl.142.1450183956795; Tue, 15 Dec 2015 04:52:36 -0800 (PST) Received: from P142s.sics.se (P142s.sics.se. [193.10.66.127]) by smtp.gmail.com with ESMTPSA id zs6sm192481lbb.26.2015.12.15.04.52.35 (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 15 Dec 2015 04:52:36 -0800 (PST) Received: from P142s.sics.se (localhost [127.0.0.1]) by P142s.sics.se (8.15.2/8.15.2) with ESMTP id tBFCpxdY002670; Tue, 15 Dec 2015 13:51:59 +0100 (CET) (envelope-from bengta@P142s.sics.se) Received: (from bengta@localhost) by P142s.sics.se (8.15.2/8.15.2/Submit) id tBFCpxvD002669; Tue, 15 Dec 2015 13:51:59 +0100 (CET) (envelope-from bengta@P142s.sics.se) From: Bengt Ahlgren To: fs@freebsd.org Subject: ZFS hang in zfs_freebsd_rename User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (berkeley-unix) Date: Tue, 15 Dec 2015 13:51:59 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Dec 2015 12:52:39 -0000 We have a server running 9.3-REL which currenly has two quite large zfs pools: NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT p1 18.1T 10.7T 7.38T 59% 1.00x ONLINE - p2 43.5T 29.1T 14.4T 66% 1.00x ONLINE - It has been running without any issues for some time now. Once, just now, processes are getting stuck and impossible to kill on accessing a particular directory in the p2 pool. That pool is a 2x6 disk raidz2. One process is stuck in zfs_freebsd_rename, and other processes accessing that particular directory also get stuck. The system is now almost completely idle. Output from kgdb on the running system for that first process: Thread 651 (Thread 102157): #0 sched_switch (td=0xfffffe0b14059920, newtd=0xfffffe001633e920, flags=) at /usr/src/sys/kern/sched_ule.c:1904 #1 0xffffffff808f4604 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:485 #2 0xffffffff809308e2 in sleepq_wait (wchan=0xfffffe0135b60488, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:618 #3 0xffffffff808cf922 in __lockmgr_args (lk=0xfffffe0135b60488, flags=524544, ilk=0xfffffe0135b604b8, wmesg=, pri=, timo=, file=0xffffffff80f0d782 "/usr/src/sys/kern/vfs_subr.c", line=2337) at /usr/src/sys/kern/kern_lock.c:221 #4 0xffffffff80977369 in vop_stdlock (ap=) at lockmgr.h:97 #5 0xffffffff80dd4a04 in VOP_LOCK1_APV (vop=0xffffffff813e8160, a=0xffffffa07f935520) at vnode_if.c:2052 #6 0xffffffff80998c17 in _vn_lock (vp=0xfffffe0135b603f0, flags=524288, file=0xffffffff80f0d782 "/usr/src/sys/kern/vfs_subr.c", line=2337) at vnode_if.h:859 #7 0xffffffff8098b621 in vputx (vp=0xfffffe0135b603f0, func=1) at /usr/src/sys/kern/vfs_subr.c:2337 #8 0xffffffff81ac7955 in zfs_rename_unlock (zlpp=0xffffffa07f9356b8) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3609 #9 0xffffffff81ac8c72 in zfs_freebsd_rename (ap=) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4039 #10 0xffffffff80dd4f04 in VOP_RENAME_APV (vop=0xffffffff81b47d40, a=0xffffffa07f9358e0) at vnode_if.c:1522 #11 0xffffffff80996bbd in kern_renameat (td=, oldfd=, old=, newfd=-100, new=0x1826a9af00 , pathseg=) at vnode_if.h:636 #12 0xffffffff80cd228a in amd64_syscall (td=0xfffffe0b14059920, traced=0) at subr_syscall.c:135 #13 0xffffffff80cbc907 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396 ---Type to continue, or q to quit--- #14 0x0000000800cc1acc in ?? () Previous frame inner to this frame (corrupt stack?) Full procstat -kk -a and kgdb "thread apply all bt" can be found here: https://www.sics.se/~bengta/ZFS-hang/ I don't know how to produce "alltrace in ddb" as the instructions in the wiki says. It runs the GENERIC kernel, so perhaps it isn't possible? I checked "camcontrol tags" for all the disks in the pool - all have zeroes for dev_active, devq_queued and held. Is there anything else I can check while the machine is up? I however need to restart it pretty soon. Bengt