From owner-freebsd-bugs@freebsd.org Sun Feb 2 18:05:53 2020 Return-Path: Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 43F0A235230 for ; Sun, 2 Feb 2020 18:05:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 489f5Y15bCz4ShM for ; Sun, 2 Feb 2020 18:05:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 2599C23522F; Sun, 2 Feb 2020 18:05:53 +0000 (UTC) Delivered-To: bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 255EF23522E for ; Sun, 2 Feb 2020 18:05:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 489f5Y0D1tz4ShL for ; Sun, 2 Feb 2020 18:05:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 00A701BC07 for ; Sun, 2 Feb 2020 18:05:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 012I5qo6055200 for ; Sun, 2 Feb 2020 18:05:52 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 012I5qES055194 for bugs@FreeBSD.org; Sun, 2 Feb 2020 18:05:52 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 243814] ZFS deadlock when adding cache partition Date: Sun, 02 Feb 2020 18:05:52 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 12.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: jfc@mit.edu X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 02 Feb 2020 18:05:53 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243814 Bug ID: 243814 Summary: ZFS deadlock when adding cache partition Product: Base System Version: 12.1-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: jfc@mit.edu My system hung in ZFS, probably a deadlock. I doubt this is reproducible b= ut there have been scattered other reports so I'll add more data. The machine is an HPE ProLiant with AMD CPU, 96 GB, UFS root on NVME, a ZFS mirrored pool on two spinning disks, and a ZFS raidz2 pool on five spinning disks encrypted with full-disk geli. Both pools have failmode=3Dcontinue. = The mirrored pool was idle. The raidz2 pool caused the hang. zpool add ... cache ... was hung waiting on tx->tx_sync_done_cv. Meanwhile= , a zpool iostat was hung waiting on spa_namespace_lock. shutdown -r failed to reboot, probably because of a deadlock. After the 90 second timeout in rc.shutdown init attempted to go into single user mode but never started a shell. I had to power cycle the system. Now back in time to the setup. I was moving a lot of data into the pool and from filesystem to filesystem within the pool. I noticed the transfer tended to stop for minutes at a ti= me, and ZFS administrative actions and the sync command also took several minut= es.=20 Probably waiting for many gigabytes of dirty buffers to write. I thought adding a cache drive might help, so I created a partition on the NVME drive= and ran # zpool add private cache nvd0p4 ("private" is the name of my pool) And that's where everything went wrong. Instead of a 1-10 minute wait for buffers to drain I/O to the pool was totally hung. Control-T reported load: 3.47 cmd: zpool 79137 [scl->scl_cv] 1.09r 0.00u 0.00s 0% 4284k load: 3.47 cmd: zpool 79137 [scl->scl_cv] 2.32r 0.00u 0.00s 0% 4284k load: 3.19 cmd: zpool 79137 [scl->scl_cv] 6.86r 0.00u 0.00s 0% 4284k load: 1.61 cmd: zpool 79137 [tx->tx_sync_done_cv] 500.21r 0.00u 0.00s 0% 4= 284k load: 0.96 cmd: zpool 79137 [tx->tx_sync_done_cv] 566.71r 0.00u 0.00s 0% 4= 284k load: 0.75 cmd: zpool 79137 [tx->tx_sync_done_cv] 577.43r 0.00u 0.00s 0% 4= 284k load: 0.61 cmd: zpool 79137 [tx->tx_sync_done_cv] 611.66r 0.00u 0.00s 0% 4= 284k load: 0.14 cmd: zpool 79137 [tx->tx_sync_done_cv] 736.87r 0.00u 0.00s 0% 4= 284k load: 0.05 cmd: zpool 79137 [tx->tx_sync_done_cv] 792.23r 0.00u 0.00s 0% 4= 284k load: 0.18 cmd: zpool 79137 [tx->tx_sync_done_cv] 997.17r 0.00u 0.00s 0% 4= 284k load: 0.37 cmd: zpool 79137 [tx->tx_sync_done_cv] 1198.80r 0.00u 0.00s 0% 4284k load: 0.49 cmd: zpool 79137 [tx->tx_sync_done_cv] 1339.23r 0.00u 0.00s 0% 4284k Meanwhile in another shell, ^T to a zpool iostat reported load: 0.24 cmd: zpool 50732 [spa_namespace_lock] 179170.71r 0.01u 0.08s 0% 1068k /var/log/messages has these lines that seem relevant. I created a partition intended for cache and then resized it before using it. That seems to crea= te the error 6 condition; I've seen it elsewhere but without any side effects. kernel: g_access(958): provider gptid/af44cd24-36ef-11ea-a744-48df37a69238 = has error 6 set syslogd: last message repeated 2 times kernel: g_dev_taste: make_dev_p() failed (gp->name=3Dgptid/af44cd24-36ef-11ea-a744-48df37a69238, error=3D17) ZFS[79140]: vdev state changed, pool_guid=3D$706653905921838876 vdev_guid=3D$2416291949121178716 ZFS[79141]: vdev state changed, pool_guid=3D$706653905921838876 vdev_guid=3D$2416291949121178716 The partition UUID is for the cache partition and the pool UUID is for the = pool I was using. (There is also a second pool that was mounted but idle.) But= I have failmode=3Dcontinue so an I/O error should not hang the system. --=20 You are receiving this mail because: You are the assignee for the bug.=