Date: Sun, 02 Feb 2020 18:05:52 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 243814] ZFS deadlock when adding cache partition Message-ID: <bug-243814-227@https.bugs.freebsd.org/bugzilla/>
index | next in thread | raw e-mail
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243814 Bug ID: 243814 Summary: ZFS deadlock when adding cache partition Product: Base System Version: 12.1-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: jfc@mit.edu My system hung in ZFS, probably a deadlock. I doubt this is reproducible but there have been scattered other reports so I'll add more data. The machine is an HPE ProLiant with AMD CPU, 96 GB, UFS root on NVME, a ZFS mirrored pool on two spinning disks, and a ZFS raidz2 pool on five spinning disks encrypted with full-disk geli. Both pools have failmode=continue. The mirrored pool was idle. The raidz2 pool caused the hang. zpool add ... cache ... was hung waiting on tx->tx_sync_done_cv. Meanwhile, a zpool iostat was hung waiting on spa_namespace_lock. shutdown -r failed to reboot, probably because of a deadlock. After the 90 second timeout in rc.shutdown init attempted to go into single user mode but never started a shell. I had to power cycle the system. Now back in time to the setup. I was moving a lot of data into the pool and from filesystem to filesystem within the pool. I noticed the transfer tended to stop for minutes at a time, and ZFS administrative actions and the sync command also took several minutes. Probably waiting for many gigabytes of dirty buffers to write. I thought adding a cache drive might help, so I created a partition on the NVME drive and ran # zpool add private cache nvd0p4 ("private" is the name of my pool) And that's where everything went wrong. Instead of a 1-10 minute wait for buffers to drain I/O to the pool was totally hung. Control-T reported load: 3.47 cmd: zpool 79137 [scl->scl_cv] 1.09r 0.00u 0.00s 0% 4284k load: 3.47 cmd: zpool 79137 [scl->scl_cv] 2.32r 0.00u 0.00s 0% 4284k load: 3.19 cmd: zpool 79137 [scl->scl_cv] 6.86r 0.00u 0.00s 0% 4284k load: 1.61 cmd: zpool 79137 [tx->tx_sync_done_cv] 500.21r 0.00u 0.00s 0% 4284k load: 0.96 cmd: zpool 79137 [tx->tx_sync_done_cv] 566.71r 0.00u 0.00s 0% 4284k load: 0.75 cmd: zpool 79137 [tx->tx_sync_done_cv] 577.43r 0.00u 0.00s 0% 4284k load: 0.61 cmd: zpool 79137 [tx->tx_sync_done_cv] 611.66r 0.00u 0.00s 0% 4284k load: 0.14 cmd: zpool 79137 [tx->tx_sync_done_cv] 736.87r 0.00u 0.00s 0% 4284k load: 0.05 cmd: zpool 79137 [tx->tx_sync_done_cv] 792.23r 0.00u 0.00s 0% 4284k load: 0.18 cmd: zpool 79137 [tx->tx_sync_done_cv] 997.17r 0.00u 0.00s 0% 4284k load: 0.37 cmd: zpool 79137 [tx->tx_sync_done_cv] 1198.80r 0.00u 0.00s 0% 4284k load: 0.49 cmd: zpool 79137 [tx->tx_sync_done_cv] 1339.23r 0.00u 0.00s 0% 4284k Meanwhile in another shell, ^T to a zpool iostat reported load: 0.24 cmd: zpool 50732 [spa_namespace_lock] 179170.71r 0.01u 0.08s 0% 1068k /var/log/messages has these lines that seem relevant. I created a partition intended for cache and then resized it before using it. That seems to create the error 6 condition; I've seen it elsewhere but without any side effects. kernel: g_access(958): provider gptid/af44cd24-36ef-11ea-a744-48df37a69238 has error 6 set syslogd: last message repeated 2 times kernel: g_dev_taste: make_dev_p() failed (gp->name=gptid/af44cd24-36ef-11ea-a744-48df37a69238, error=17) ZFS[79140]: vdev state changed, pool_guid=$706653905921838876 vdev_guid=$2416291949121178716 ZFS[79141]: vdev state changed, pool_guid=$706653905921838876 vdev_guid=$2416291949121178716 The partition UUID is for the cache partition and the pool UUID is for the pool I was using. (There is also a second pool that was mounted but idle.) But I have failmode=continue so an I/O error should not hang the system. -- You are receiving this mail because: You are the assignee for the bug.help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243814-227>
