Date: Sat, 19 Jan 2013 19:00:53 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: Gavin Atkinson <gavin@FreeBSD.org> Cc: freebsd-fs@FreeBSD.org Subject: Re: ZFS lock up 9-stable r244911 (Jan) Message-ID: <50FAD145.10906@FreeBSD.org> In-Reply-To: <alpine.BSF.2.00.1301181356140.29541@thunderhorn.york.ac.uk> References: <alpine.BSF.2.00.1301181356140.29541@thunderhorn.york.ac.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
on 18/01/2013 17:01 Gavin Atkinson said the following: > > Hi all, > > I have a machine on which ZFS appears to have locked up, and any processes > that attempt to access the ZFS filesystem. This machine is running > 9-stable amd64 r244911 (though from cvs, not SVN), and therefore I believe > has all of avg's ZFS deadlock patches. > > This machine has both UFS and ZFS filesystems. All of the "system" > filesystems are on UFS, and as a result the machine itself is responsive > and I can investigate state using kgdb against the live kernel. I've > included all thread backtraces, a couple of other bits relating to held > locks, and ps/sysctl output at > http://people.freebsd.org/~gavin/tay-zfs-hang.txt > http://people.freebsd.org/~gavin/tay-sysctl-a.txt > http://people.freebsd.org/~gavin/tay-ps-auxwwwH.txt > > This machine was in use as a pkgng package builder, using poudriere. > Poudriere makes heavy use of zfs filesystems within jails, including "zfs > get", "zfs set", "zfs snapshot", "zfs rollback", "zfs diff" and other > commands, although there do not appear to be any instances of the zfs > process running currently. At the time of the hang 16 parallel builds were > in progress, > > The underlying disk subsystem is a single hardware RAID-10 on a twa > controller, and the zpool is on a single partition of this device. The > RAID-10 itself is intact, the controller reports no errors. There is no > L2ARC or separate ZIL. The UFS filesystems (still seem to be fully > functional) are on separate partitions on the same underlying device, so I > do not believe the underlying storage is having issues. I can "dd" from > the underlying ZFS partition without problem. Nothing has been logged to > /var/log/messages. Based on the above information plus some additional debugging information that Gavin has kindly provided to me I've developed the following "theory" to explain this deadlock. I believe that was very high disk load (overwhelmingly high load) before the deadlock occurred. Further I think that there was a substantial number of high priority writes. Under those conditions the number of in-progress/pending zio-s was constantly at zfs_vdev_max_pending (by default 10). Number of queued zio-s was above hundred: vdev_queue = { vq_deadline_tree = {avl_root = 0xfffffe0338dbb248, avl_compar = 0xffffffff816855b0 <vdev_queue_deadline_compare>, avl_offset = 584, avl_numnodes = 116, avl_size = 896}, vq_read_tree = {avl_root = 0xfffffe019d0b65b0, avl_compar = 0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes = 8, avl_size = 896}, vq_write_tree = { avl_root = 0xfffffe03e3d19230, avl_compar = 0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes = 108, avl_size = 896}, vq_pending_tree = {avl_root = 0xfffffe025e32c230, avl_compar = 0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes = 10, avl_size = 896}, vq_lock = {lock_object = {lo_name = 0xffffffff8172afc9 "vq->vq_lock", lo_flags = 40960000, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}}, vdev_cache = {vc_offset_tree = {avl_root = 0x0, avl_compar = 0xffffffff81681740 <vdev_cache_offset_compare>, avl_offset = 24, avl_numnodes = 0, avl_size = 88}, vc_lastused_tree = { avl_root = 0x0, avl_compar = 0xffffffff81681760 <vdev_cache_lastused_compare>, avl_offset = 48, avl_numnodes = 0, avl_size = 88} Apparently processing of zio-s was so lagging behind that some executed zio-s triggered "late arrival" condition. My incomplete understanding shows here - I am not sure what exactly triggers the condition and what is so special about it, but from the following stack traces we can see that all five of zio_write_intr_high taskqueue threads were executing dmu_sync_late_arrival_done(): 0 100432 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1 dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0 zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 0 100433 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1 dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0 zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 0 100434 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1 dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0 zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 0 100435 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1 dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0 zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 0 100436 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1 dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0 zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe In additional to the above, the taskqueue associated with the above threads has another 9 pending tasks. As you can see that "late arrival" code path involves txg_rele_to_sync() where an instance tc_lock is acquired. Further, it looks that tc_lock instances are held by the following threads: 64998 101921 pkg initial thread mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_create+0x310 VOP_CREATE_APV+0x31 vn_open_cred+0x4b7 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 66152 102491 pkg initial thread mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x37e vn_io_fault+0x90 dofilewrite+0x85 kern_writev+0x6c sys_write+0x64 amd64_syscall+0x540 Xfast_syscall+0xf7 75803 101638 find - mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86 vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7 75809 102932 find - mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86 vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7 75813 101515 find - mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86 vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7 77468 101412 update-mime-databas initial thread mi_switch+0x186 sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x37e vn_io_fault+0x90 dofilewrite+0x85 kern_writev+0x6c sys_write+0x64 amd64_syscall+0x540 Xfast_syscall+0xf7 These threads calls txg_delay also because of the high load. In the code we see that dmu_tx_assign first grabs an instance of tc_lock and then calls dsl_dir_tempreserve_space. Also, we see that txg_delay tries to acquire tx_sync_lock and that's where all these threads are block. Then we see that txg_sync_thread holds tx_sync_lock, but in its turn it is blocked waiting for its zio: 1552 100544 zfskern txg_thread_enter mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x112 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_lookup_norm+0x45 zap_lookup+0x2e feature_get_refcount+0x4b spa_feature_is_active+0x52 dsl_scan_active+0x63 txg_sync_thread+0x20d fork_exit+0x11f fork_trampoline+0xe So a summary. For some completed zio-s their zio_done routines are blocked because of dmu_sync_late_arrival_done->txg_rele_to_sync->tc_lock. tc_locks are held by threads in dmu_tx_assign->..->txg_delay where txg_delay is blocked on tx_sync_lock. tx_sync_lock is held by txg_sync_thread which waits for its zio to be processed. The zio is held on queue and is not getting processed because the vdev already has too many pending/in-progress zio-s. This theory looks plausible to me, but I'd like to hear what the experts think. Even more important question is how this situation can be avoided. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50FAD145.10906>