Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 15 Dec 2015 16:32:39 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        Bengt Ahlgren <bengta@sics.se>, freebsd-fs@freebsd.org
Subject:   Re: ZFS hang in zfs_freebsd_rename
Message-ID:  <865572400.133527790.1450215159693.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <56702A9F.90702@multiplay.co.uk>
References:  <uh7a8pbj2mo.fsf@P142s.sics.se> <567022FB.1010508@multiplay.co.uk> <uh7vb7zhihv.fsf@P142s.sics.se> <56702A9F.90702@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
I'm not a ZFS guy, but I vaguely recall that renaming of snapshots
can (or at least could, I don't know if it has been fixed) cause
hung threads due to lock ordering issues.

So, if by any chance you are renaming snapshots, you might want to
avoid doing that.

rick

----- Original Message -----
> There have been quite a few reported issues with this some at least have
> been fix, but as with anything the only way to be sure is to test it.
> 
> On 15/12/2015 14:52, Bengt Ahlgren wrote:
> > Yes, that is on the todo list...
> >
> > So this is likely fixed then in 10.x?
> >
> > Bengt
> >
> > Steven Hartland <killing@multiplay.co.uk> writes:
> >
> >> Not a surprise in 9.x unfortunately, try upgrading to 10.x
> >>
> >> On 15/12/2015 12:51, Bengt Ahlgren wrote:
> >>> We have a server running 9.3-REL which currenly has two quite large zfs
> >>> pools:
> >>>
> >>> NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
> >>> p1    18.1T  10.7T  7.38T    59%  1.00x  ONLINE  -
> >>> p2    43.5T  29.1T  14.4T    66%  1.00x  ONLINE  -
> >>>
> >>> It has been running without any issues for some time now.  Once, just
> >>> now, processes are getting stuck and impossible to kill on accessing a
> >>> particular directory in the p2 pool.  That pool is a 2x6 disk raidz2.
> >>>
> >>> One process is stuck in zfs_freebsd_rename, and other processes
> >>> accessing that particular directory also get stuck.  The system is now
> >>> almost completely idle.
> >>>
> >>> Output from kgdb on the running system for that first process:
> >>>
> >>> Thread 651 (Thread 102157):
> >>> #0  sched_switch (td=0xfffffe0b14059920, newtd=0xfffffe001633e920,
> >>> flags=<value optimized out>)
> >>>       at /usr/src/sys/kern/sched_ule.c:1904
> >>> #1  0xffffffff808f4604 in mi_switch (flags=260, newtd=0x0) at
> >>> /usr/src/sys/kern/kern_synch.c:485
> >>> #2 0xffffffff809308e2 in sleepq_wait (wchan=0xfffffe0135b60488,
> >>> pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:618
> >>> #3  0xffffffff808cf922 in __lockmgr_args (lk=0xfffffe0135b60488,
> >>> flags=524544, ilk=0xfffffe0135b604b8,
> >>>       wmesg=<value optimized out>, pri=<value optimized out>, timo=<value
> >>>       optimized out>,
> >>>       file=0xffffffff80f0d782 "/usr/src/sys/kern/vfs_subr.c", line=2337)
> >>>       at /usr/src/sys/kern/kern_lock.c:221
> >>> #4  0xffffffff80977369 in vop_stdlock (ap=<value optimized out>) at
> >>> lockmgr.h:97
> >>> #5  0xffffffff80dd4a04 in VOP_LOCK1_APV (vop=0xffffffff813e8160,
> >>> a=0xffffffa07f935520) at vnode_if.c:2052
> >>> #6  0xffffffff80998c17 in _vn_lock (vp=0xfffffe0135b603f0, flags=524288,
> >>>       file=0xffffffff80f0d782 "/usr/src/sys/kern/vfs_subr.c", line=2337)
> >>>       at vnode_if.h:859
> >>> #7  0xffffffff8098b621 in vputx (vp=0xfffffe0135b603f0, func=1) at
> >>> /usr/src/sys/kern/vfs_subr.c:2337
> >>> #8  0xffffffff81ac7955 in zfs_rename_unlock (zlpp=0xffffffa07f9356b8)
> >>>       at
> >>>       /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3609
> >>> #9  0xffffffff81ac8c72 in zfs_freebsd_rename (ap=<value optimized out>)
> >>>       at
> >>>       /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4039
> >>> #10 0xffffffff80dd4f04 in VOP_RENAME_APV (vop=0xffffffff81b47d40,
> >>> a=0xffffffa07f9358e0) at vnode_if.c:1522
> >>> #11 0xffffffff80996bbd in kern_renameat (td=<value optimized out>,
> >>> oldfd=<value optimized out>,
> >>>       old=<value optimized out>, newfd=-100, new=0x1826a9af00 <Error
> >>>       reading address 0x1826a9af00: Bad address>,
> >>>       pathseg=<value optimized out>) at vnode_if.h:636
> >>> #12 0xffffffff80cd228a in amd64_syscall (td=0xfffffe0b14059920, traced=0)
> >>> at subr_syscall.c:135
> >>> #13 0xffffffff80cbc907 in Xfast_syscall () at
> >>> /usr/src/sys/amd64/amd64/exception.S:396
> >>> ---Type <return> to continue, or q <return> to quit---
> >>> #14 0x0000000800cc1acc in ?? ()
> >>> Previous frame inner to this frame (corrupt stack?)
> >>>
> >>> Full procstat -kk -a and kgdb "thread apply all bt" can be found here:
> >>>
> >>> https://www.sics.se/~bengta/ZFS-hang/
> >>>
> >>> I don't know how to produce "alltrace in ddb" as the instructions in the
> >>> wiki says.  It runs the GENERIC kernel, so perhaps it isn't possible?
> >>>
> >>> I checked "camcontrol tags" for all the disks in the pool - all have
> >>> zeroes for dev_active, devq_queued and held.
> >>>
> >>> Is there anything else I can check while the machine is up?  I however
> >>> need to restart it pretty soon.
> >>>
> >>> Bengt
> >>> _______________________________________________
> >>> freebsd-fs@freebsd.org mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?865572400.133527790.1450215159693.JavaMail.zimbra>