Date: Mon, 29 Aug 2011 22:54:07 +0200 From: Martin Matuska <mm@FreeBSD.org> To: Artem Belevich <art@freebsd.org> Cc: freebsd-fs@freebsd.org, tech@hybrid-logic.co.uk, luke@hybrid-logic.co.uk Subject: Re: ZFS hang in production on 8.2-RELEASE Message-ID: <4E5BFC6F.5080507@FreeBSD.org> In-Reply-To: <CAFqOu6gHvwxiOkFZ0Enh3VRHcs3aD=gH4u_6=XuhfYXg5NnkpQ@mail.gmail.com> References: <1314646728.7898.44.camel@pow> <CAFqOu6gHvwxiOkFZ0Enh3VRHcs3aD=gH4u_6=XuhfYXg5NnkpQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 29. 8. 2011 21:55, Artem Belevich wrote: > On Mon, Aug 29, 2011 at 12:38 PM, Luke Marsden > <luke-lists@hybrid-logic.co.uk> wrote: >> Hi all, >> >> I've just noticed a "partial" ZFS deadlock in production on 8.2-RELEASE. >> >> FreeBSD XXX 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r219081M: Wed Mar 2 >> 08:29:52 CET 2011 root@www4:/usr/obj/usr/src/sys/GENERIC amd64 >> >> There are 9 'zfs rename' processes and 1 'zfs umount -f' processes hung. >> Here is the procstat for the 'zfs umount -f': >> >> 13451 104337 zfs - mi_switch+0x176 >> sleepq_wait+0x42 _sleep+0x317 zfsvfs_teardown+0x269 zfs_umount+0x1c4 >> dounmount+0x32a unmount+0x38b syscallenter+0x1e5 syscall+0x4b >> Xfast_syscall+0xe2 >> >> And the 'zfs rename's all look the same: >> >> 20361 101049 zfs - mi_switch+0x176 >> sleepq_wait+0x42 __lockmgr_args+0x743 vop_stdlock+0x39 VOP_LOCK1_APV >> +0x46 _vn_lock+0x47 lookup+0x6e1 namei+0x53a kern_rmdirat+0xa4 >> syscallenter+0x1e5 syscall+0x4b Xfast_syscall+0xe2 >> >> An 'ls' on a directory which contains most of the system's ZFS >> mount-points (/hcfs) also hangs: >> >> 30073 101466 gnuls - mi_switch+0x176 >> sleepq_wait+0x42 __lockmgr_args+0x743 vop_stdlock+0x39 VOP_LOCK1_APV >> +0x46 _vn_lock+0x47 zfs_root+0x85 lookup+0x9b8 namei+0x53a vn_open_cred >> +0x3ac kern_openat+0x181 syscallenter+0x1e5 syscall+0x4b Xfast_syscall >> +0xe2 >> >> If I truss the 'ls' it hangs on the stat syscall: >> stat("/hcfs",{ mode=drwxr-xr-x ,inode=3,size=2012,blksize=16384 }) = 0 >> (0x0) >> >> There is also a 'find -s / ! ( -fstype zfs ) -prune -or -path /tmp >> -prune -or -path /usr/tmp -prune -or -path /var/tmp -prune -or >> -path /var/db/portsnap -prune -or -print' running which is also hung: >> >> 2650 101674 find - mi_switch+0x176 >> sleepq_wait+0x42 __lockmgr_args+0x743 vop_stdlock+0x39 VOP_LOCK1_APV >> +0x46 _vn_lock+0x47 zfs_root+0x85 lookup+0x9b8 namei+0x53a vn_open_cred >> +0x3ac kern_openat+0x181 syscallenter+0x1e5 syscall+0x4b Xfast_syscall >> +0xe2 >> >> However I/O to the presently mounted filesystems continues to work (even >> on parts of filesystems which are unlikely to be cached), and 'zfs list' >> showing all the filesystems (3,500 filesystems with ~100 snapshots per >> filesystem) also works. >> >> Any activity on the structure of the ZFS hierarchy *under the hcfs >> filesystem* crashes, such as a 'zfs create hpool/hcfs/test': >> >> 70868 101874 zfs - mi_switch+0x176 >> sleepq_wait+0x42 __lockmgr_args+0x743 vop_stdlock+0x39 VOP_LOCK1_APV >> +0x46 _vn_lock+0x47 lookup+0x6e1 namei+0x53a kern_mkdirat+0xce >> syscallenter+0x1e5 syscall+0x4b Xfast_syscall+0xe2 >> >> BUT "zfs create hpool/system/opt/hello" (a ZFS filesystem in the same >> pool, but not rooted on hpool/hcfs) does not hang, and succeeds >> normally. >> >> procstat -kk on the zfskern process gives: >> >> PID TID COMM TDNAME >> KSTACK >> 5 100045 zfskern arc_reclaim_thre mi_switch+0x176 >> sleepq_timedwait+0x42 _cv_timedwait+0x134 arc_reclaim_thread+0x2a9 >> fork_exit+0x118 fork_trampoline+0xe >> 5 100046 zfskern l2arc_feed_threa mi_switch+0x176 >> sleepq_timedwait+0x42 _cv_timedwait+0x134 l2arc_feed_thread+0x1ce >> fork_exit+0x118 fork_trampoline+0xe >> 5 100098 zfskern txg_thread_enter mi_switch+0x176 >> sleepq_wait+0x42 _cv_wait+0x129 txg_thread_wait+0x79 txg_quiesce_thread >> +0xb5 fork_exit+0x118 fork_trampoline+0xe >> 5 100099 zfskern txg_thread_enter mi_switch+0x176 >> sleepq_timedwait+0x42 _cv_timedwait+0x134 txg_thread_wait+0x3c >> txg_sync_thread+0x365 fork_exit+0x118 fork_trampoline+0xe >> >> Any ideas on what might be causing this? > It sounds like the bug Martin Matuska has recently fixed in FreeBSD > and reported upstream to Illumos: > https://www.illumos.org/issues/1313 > > The fix has been MFC'ed to 8-STABLE r224647 on Aug 4th. > > --Artem No, I think this is more likely fixed by pjd's bugfix in r224791 (MFC'ed to stable/8 as r225100). The corresponding patch is: http://people.freebsd.org/~pjd/patches/zfsdev_state_lock.patch -- Martin Matuska FreeBSD committer http://blog.vx.sk
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E5BFC6F.5080507>