From owner-freebsd-bugs@freebsd.org Sat Jan 20 07:03:57 2018 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 24844EBF07D for ; Sat, 20 Jan 2018 07:03:57 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.ysv.freebsd.org (mxrelay.ysv.freebsd.org [IPv6:2001:1900:2254:206a::19:3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mxrelay.ysv.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0B3656BCB0 for ; Sat, 20 Jan 2018 07:03:57 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.ysv.freebsd.org (Postfix) with ESMTPS id 00516189E0 for ; Sat, 20 Jan 2018 07:03:57 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id w0K73ueh099688 for ; Sat, 20 Jan 2018 07:03:56 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id w0K73uuT099687 for freebsd-bugs@FreeBSD.org; Sat, 20 Jan 2018 07:03:56 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 225337] z_teardown_inactive_lock held inordinately long Date: Sat, 20 Jan 2018 07:03:56 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: wollman@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Jan 2018 07:03:57 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D225337 Bug ID: 225337 Summary: z_teardown_inactive_lock held inordinately long Product: Base System Version: 11.1-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: wollman@FreeBSD.org On one of our large NFS servers, it seems that some process holds zfsvfs->z_teardown_inactive_lock far too long -- on the order of ten minute= s or more -- causing all filesystem activity to hang. The exact same configurat= ion and activity patterns did not have such a hang under 10.3 I believe from w= eb searches that this lock is implicated in zfs dataset rollback and consequen= tly zfs recv -F, but the hang only seems to take place when we have both pull replication (zfs recv) *and* active (through-the-filesystem) backups runnin= g at the same time, which usually only happens late at night. There are no cons= ole messages or other indications of faults in the underlying storage system. = The system as a whole becomes completely unusable, our monitoring system raises alarms, but it doesn't actually crash, and whatever it was eventually does complete without visible errors. I'm temporarily disabling the replication job to see if that truly is the smoking gun. Or rather, I'm going to do that once I get access to the filesystem again. Example, taken from my ssh session over the past hour (these are all waiting for the same shell script to *begin executing*): load: 0.82 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 7.42r 0.00u 0.00s 0% 3624k load: 0.71 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 23.00r 0.00u 0.00s 0% 3624k load: 0.59 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 38.85r 0.00u 0.00s 0% 3624k load: 1.02 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 88.32r 0.00u 0.00s 0% 3624k load: 0.81 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 149.97r 0.00u 0.00s 0% 3624k load: 0.76 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 181.17r 0.00u 0.00s 0% 3624k load: 1.51 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 243.76r 0.00u 0.00s 0% 3624k load: 0.96 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 282.39r 0.00u 0.00s 0% 3624k load: 1.50 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 333.94r 0.00u 0.00s 0% 3624k load: 0.93 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 392.77r 0.00u 0.00s 0% 3624k load: 0.84 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 457.04r 0.00u 0.00s 0% 3624k load: 0.85 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 526.06r 0.00u 0.00s 0% 3624k load: 0.40 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 588.82r 0.00u 0.00s 0% 3624k My suspicion is that the primary vector is zfs recv on a dataset that is currently being backed up, but why this causes all other filesystem activit= y to become blocked is a bit unclear to me. (Race to the root? I think the bac= kup software uses openat(2) and shouldn't cause that sort of problem, but maybe random NFS clients can.) --=20 You are receiving this mail because: You are the assignee for the bug.=