Date: Tue, 15 Dec 2015 23:06:13 +0100 From: Bengt Ahlgren <bengta@sics.se> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: Steven Hartland <killing@multiplay.co.uk>, freebsd-fs@freebsd.org Subject: Re: ZFS hang in zfs_freebsd_rename Message-ID: <uh7mvtbwene.fsf@P142s.sics.se> In-Reply-To: <865572400.133527790.1450215159693.JavaMail.zimbra@uoguelph.ca> (Rick Macklem's message of "Tue, 15 Dec 2015 16:32:39 -0500 (EST)") References: <uh7a8pbj2mo.fsf@P142s.sics.se> <567022FB.1010508@multiplay.co.uk> <uh7vb7zhihv.fsf@P142s.sics.se> <56702A9F.90702@multiplay.co.uk> <865572400.133527790.1450215159693.JavaMail.zimbra@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
The pool has a few snapshots, but no renaming of them took place any time recently. This was renaming of a file. Bengt Rick Macklem <rmacklem@uoguelph.ca> writes: > I'm not a ZFS guy, but I vaguely recall that renaming of snapshots > can (or at least could, I don't know if it has been fixed) cause > hung threads due to lock ordering issues. > > So, if by any chance you are renaming snapshots, you might want to > avoid doing that. > > rick > > ----- Original Message ----- >> There have been quite a few reported issues with this some at least have >> been fix, but as with anything the only way to be sure is to test it. >> >> On 15/12/2015 14:52, Bengt Ahlgren wrote: >> > Yes, that is on the todo list... >> > >> > So this is likely fixed then in 10.x? >> > >> > Bengt >> > >> > Steven Hartland <killing@multiplay.co.uk> writes: >> > >> >> Not a surprise in 9.x unfortunately, try upgrading to 10.x >> >> >> >> On 15/12/2015 12:51, Bengt Ahlgren wrote: >> >>> We have a server running 9.3-REL which currenly has two quite large zfs >> >>> pools: >> >>> >> >>> NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT >> >>> p1 18.1T 10.7T 7.38T 59% 1.00x ONLINE - >> >>> p2 43.5T 29.1T 14.4T 66% 1.00x ONLINE - >> >>> >> >>> It has been running without any issues for some time now. Once, just >> >>> now, processes are getting stuck and impossible to kill on accessing a >> >>> particular directory in the p2 pool. That pool is a 2x6 disk raidz2. >> >>> >> >>> One process is stuck in zfs_freebsd_rename, and other processes >> >>> accessing that particular directory also get stuck. The system is now >> >>> almost completely idle. >> >>> >> >>> Output from kgdb on the running system for that first process: >> >>> >> >>> Thread 651 (Thread 102157): >> >>> #0 sched_switch (td=0xfffffe0b14059920, newtd=0xfffffe001633e920, >> >>> flags=<value optimized out>) >> >>> at /usr/src/sys/kern/sched_ule.c:1904 >> >>> #1 0xffffffff808f4604 in mi_switch (flags=260, newtd=0x0) at >> >>> /usr/src/sys/kern/kern_synch.c:485 >> >>> #2 0xffffffff809308e2 in sleepq_wait (wchan=0xfffffe0135b60488, >> >>> pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:618 >> >>> #3 0xffffffff808cf922 in __lockmgr_args (lk=0xfffffe0135b60488, >> >>> flags=524544, ilk=0xfffffe0135b604b8, >> >>> wmesg=<value optimized out>, pri=<value optimized out>, timo=<value >> >>> optimized out>, >> >>> file=0xffffffff80f0d782 "/usr/src/sys/kern/vfs_subr.c", line=2337) >> >>> at /usr/src/sys/kern/kern_lock.c:221 >> >>> #4 0xffffffff80977369 in vop_stdlock (ap=<value optimized out>) at >> >>> lockmgr.h:97 >> >>> #5 0xffffffff80dd4a04 in VOP_LOCK1_APV (vop=0xffffffff813e8160, >> >>> a=0xffffffa07f935520) at vnode_if.c:2052 >> >>> #6 0xffffffff80998c17 in _vn_lock (vp=0xfffffe0135b603f0, flags=524288, >> >>> file=0xffffffff80f0d782 "/usr/src/sys/kern/vfs_subr.c", line=2337) >> >>> at vnode_if.h:859 >> >>> #7 0xffffffff8098b621 in vputx (vp=0xfffffe0135b603f0, func=1) at >> >>> /usr/src/sys/kern/vfs_subr.c:2337 >> >>> #8 0xffffffff81ac7955 in zfs_rename_unlock (zlpp=0xffffffa07f9356b8) >> >>> at >> >>> /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3609 >> >>> #9 0xffffffff81ac8c72 in zfs_freebsd_rename (ap=<value optimized out>) >> >>> at >> >>> /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4039 >> >>> #10 0xffffffff80dd4f04 in VOP_RENAME_APV (vop=0xffffffff81b47d40, >> >>> a=0xffffffa07f9358e0) at vnode_if.c:1522 >> >>> #11 0xffffffff80996bbd in kern_renameat (td=<value optimized out>, >> >>> oldfd=<value optimized out>, >> >>> old=<value optimized out>, newfd=-100, new=0x1826a9af00 <Error >> >>> reading address 0x1826a9af00: Bad address>, >> >>> pathseg=<value optimized out>) at vnode_if.h:636 >> >>> #12 0xffffffff80cd228a in amd64_syscall (td=0xfffffe0b14059920, traced=0) >> >>> at subr_syscall.c:135 >> >>> #13 0xffffffff80cbc907 in Xfast_syscall () at >> >>> /usr/src/sys/amd64/amd64/exception.S:396 >> >>> ---Type <return> to continue, or q <return> to quit--- >> >>> #14 0x0000000800cc1acc in ?? () >> >>> Previous frame inner to this frame (corrupt stack?) >> >>> >> >>> Full procstat -kk -a and kgdb "thread apply all bt" can be found here: >> >>> >> >>> https://www.sics.se/~bengta/ZFS-hang/ >> >>> >> >>> I don't know how to produce "alltrace in ddb" as the instructions in the >> >>> wiki says. It runs the GENERIC kernel, so perhaps it isn't possible? >> >>> >> >>> I checked "camcontrol tags" for all the disks in the pool - all have >> >>> zeroes for dev_active, devq_queued and held. >> >>> >> >>> Is there anything else I can check while the machine is up? I however >> >>> need to restart it pretty soon. >> >>> >> >>> Bengt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?uh7mvtbwene.fsf>