Date: Mon, 13 Oct 2014 01:54:05 +0100 From: "Steven Hartland" <killing@multiplay.co.uk> To: "Mark Martinec" <Mark.Martinec+freebsd@ijs.si>, <freebsd-stable@freebsd.org>, <freebsd-fs@freebsd.org> Subject: Re: zfs pool import hangs on [tx->tx_sync_done_cv] Message-ID: <A5BA41116A7F4B23A9C9E469C4146B99@multiplay.co.uk> References: <54372173.1010100@ijs.si> <644FA8299BF848E599B82D2C2C298EA7@multiplay.co.uk> <54372EBA.1000908@ijs.si> <DE7DD7A94E9B4F1FBB3AFF57EDB47C67@multiplay.co.uk> <543731F3.8090701@ijs.si> <543AE740.7000808@ijs.si>
next in thread | previous in thread | raw e-mail | index | archive | help
I have reproduced the issue with your image. It seems like the import is waiting a transation group to complete. It looks like its waiting on the zio from dsl_pool_sync_mos which for some reason is never returning. The pool does seem to be happy importing read only with: zpool import -N -f -o readonly=on sys1boot print dp->dp_tx $3 = { tx_cpu = 0xfffffe0004a05000, tx_sync_lock = { lock_object = { lo_name = 0xffffffff815f2941 "tx->tx_sync_lock", lo_flags = 40960000, lo_data = 0, lo_witness = 0x0 }, sx_lock = 1 }, tx_open_txg = 11733519, tx_quiesced_txg = 0, tx_syncing_txg = 11733518, tx_synced_txg = 0, tx_open_time = 120562502889, tx_sync_txg_waiting = 11733518, tx_quiesce_txg_waiting = 11733519, tx_sync_more_cv = { cv_description = 0xffffffff815f2953 "tx->tx_sync_more_cv", cv_waiters = 0 }, tx_sync_done_cv = { cv_description = 0xffffffff815f2968 "tx->tx_sync_done_cv", cv_waiters = 1 }, tx_quiesce_more_cv = { cv_description = 0xffffffff815f297d "tx->tx_quiesce_more_cv", cv_waiters = 1 }, tx_quiesce_done_cv = { cv_description = 0xffffffff815f2995 "tx->tx_quiesce_done_cv", cv_waiters = 0 }, tx_timeout_cv = { cv_description = 0x0, cv_waiters = 0 }, tx_exit_cv = { cv_description = 0xffffffff815f29ad "tx->tx_exit_cv", cv_waiters = 0 }, tx_threads = 2 '\002', tx_exiting = 0 '\0', tx_sync_thread = 0xfffff80041505490, tx_quiesce_thread = 0xfffff80041509920, tx_commit_cb_taskq = 0x0 } Relavent threads: #0 sched_switch (td=0xfffff800446bc000, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494 #2 0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at /usr/src/sys/kern/subr_sleepqueue.c:538 #3 0xffffffff8064dc23 in sleepq_wait (wchan=0xfffff80021f4b220, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:617 #4 0xffffffff805b7c1a in _cv_wait (cvp=0xfffff80021f4b220, lock=0xfffff80021f4b1b8) at /usr/src/sys/kern/kern_condvar.c:139 #5 0xffffffff8151ec4b in txg_wait_synced (dp=0xfffff80021f4b000, txg=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:631 #6 0xffffffff81510937 in spa_load (spa=0xfffffe0003fe1000, state=SPA_LOAD_IMPORT, type=<value optimized out>, mosconfig=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:2777 #7 0xffffffff8150e5df in spa_load_best (spa=0xfffffe0003fe1000, state=SPA_LOAD_IMPORT, mosconfig=1, max_request=<value optimized out>, rewind_flags=1) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:2873 #8 0xffffffff8150e023 in spa_import (pool=0xfffffe0002e2a000 "sys1boot", config=0xfffff80044f24ba0, props=0x0, flags=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:4207 #9 0xffffffff81566d78 in zfs_ioc_pool_import (zc=0xfffffe0002e2a000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:1594 #10 0xffffffff81563c4f in zfsdev_ioctl (dev=<value optimized out>, zcmd=<value optimized out>, arg=<value optimized out>, flag=<value optimized out>, td=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:6158 #11 0xffffffff805299eb in devfs_ioctl_f (fp=0xfffff8004f020aa0, com=3222821378, data=0xfffff80038123e60, cred=<value optimized out>, td=0xfffff800446bc000) at /usr/src/sys/fs/devfs/devfs_vnops.c:757 #12 0xffffffff8066201b in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>) at file.h:320 #13 0xffffffff80661d9c in sys_ioctl (td=0xfffff800446bc000, uap=0xfffffe011e49fa40) at /usr/src/sys/kern/sys_generic.c:702 #14 0xffffffff8088defa in amd64_syscall (td=0xfffff800446bc000, traced=0) at subr_syscall.c:134 #15 0xffffffff80870ecb in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:391 Thread 509 (Thread 100533): #0 sched_switch (td=0xfffff800446bf000, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494 #2 0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at /usr/src/sys/kern/subr_sleepqueue.c:538 #3 0xffffffff8064e323 in sleepq_timedwait (wchan=0xfffff80044e47210, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:652 #4 0xffffffff805b84f0 in _cv_timedwait_sbt (cvp=0xfffff80044e47210, lock=0xfffff80044e471b8, sbt=<value optimized out>, pr=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_condvar.c:325 #5 0xffffffff8151f72e in txg_thread_wait (tx=<value optimized out>, cpr=0xfffffe011dc76a00, cv=0xfffff80044e47210, time=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:249 #6 0xffffffff8151e930 in txg_sync_thread (arg=0xfffff80044e47000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:483 #7 0xffffffff805d5bd4 in fork_exit (callout=0xffffffff8151e480 <txg_sync_thread>, arg=0xfffff80044e47000, frame=0xfffffe011dc76ac0) at /usr/src/sys/kern/kern_fork.c:996 #8 0xffffffff8087111e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #9 0x0000000000000000 in ?? () Thread 508 (Thread 100532): #0 sched_switch (td=0xfffff80048204920, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494 #2 0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at /usr/src/sys/kern/subr_sleepqueue.c:538 #3 0xffffffff8064dc23 in sleepq_wait (wchan=0xfffff80044e47230, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:617 #4 0xffffffff805b7c1a in _cv_wait (cvp=0xfffff80044e47230, lock=0xfffff80044e471b8) at /usr/src/sys/kern/kern_condvar.c:139 #5 0xffffffff8151f73b in txg_thread_wait (tx=<value optimized out>, cpr=0xfffffe011dc6da00, cv=<value optimized out>, time=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:251 #6 0xffffffff8151e450 in txg_quiesce_thread (arg=0xfffff80044e47000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:556 #7 0xffffffff805d5bd4 in fork_exit (callout=0xffffffff8151e030 <txg_quiesce_thread>, arg=0xfffff80044e47000, frame=0xfffffe011dc6dac0) at /usr/src/sys/kern/kern_fork.c:996 #8 0xffffffff8087111e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #9 0x0000000000000000 in ?? () Thread 512 (Thread 100338): #0 sched_switch (td=0xfffff80041505490, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494 #2 0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at /usr/src/sys/kern/subr_sleepqueue.c:538 #3 0xffffffff8064dc23 in sleepq_wait (wchan=0xfffff80044acf320, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:617 #4 0xffffffff805b7c1a in _cv_wait (cvp=0xfffff80044acf320, lock=0xfffff80044acf300) at /usr/src/sys/kern/kern_condvar.c:139 #5 0xffffffff81547c9b in zio_wait (zio=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1442 #6 0xffffffff814f056c in dsl_pool_sync (dp=0xfffff80021f4b000, txg=11733518) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:531 #7 0xffffffff81514030 in spa_sync (spa=0xfffffe0003fe1000, txg=11733518) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6604 #8 0xffffffff8151e6cd in txg_sync_thread (arg=0xfffff80021f4b000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:518 #9 0xffffffff805d5bd4 in fork_exit (callout=0xffffffff8151e480 <txg_sync_thread>, arg=0xfffff80021f4b000, frame=0xfffffe011e6fdac0) at /usr/src/sys/kern/kern_fork.c:996 #10 0xffffffff8087111e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #11 0x0000000000000000 in ?? () ----- Original Message ----- From: "Mark Martinec" <Mark.Martinec+freebsd@ijs.si> To: <freebsd-stable@freebsd.org>; <freebsd-fs@freebsd.org> Sent: Sunday, October 12, 2014 9:40 PM Subject: Re: zfs pool import hangs on [tx->tx_sync_done_cv] >I made available an image copy (dd) of my 4 GiB partition > (compressed down to a 384 MiB file), holding one of my > partitions (a small bsd root) that can no longer be imported > into a 10.1-RC1 or 10.1-BETA3, as described in my first > posting (below): > > http://www.ijs.si/usr/mark/bsd/ > > I would appreciate if it can be confirmed that such pool > (one of several I have with this symptom) causes > 'zpool import' to hang on 10.1 or 10-STABLE: > > - download, xz -d sys1boot.img.xz > # mdconfig -f sys1boot.img > # zpool import sys1boot > > ... and advise on a solution. > > Considering that 'zdb -e -cc' is happy and there were no > other prior trouble with these pools, except for an upgrade > to 10.1-BETA3/-RC1 from 10-STABLE as of cca. late September, > it is my belief that these pools are still healthy, just > non-importable. I'm reluctant to upgrade any other system > from 10.0 to 10.1 without finding out what went wrong here. > > Mark > > > On 10/10/2014 03:02, Steven Hartland wrote: > > Sorry to be a pain but could you attach the full output or link it > > somewhere as mail has messed up the formatting :( > > Now at > http://www.ijs.si/usr/mark/bsd/procstat-kka-2074.txt > > On 2014-10-10 Mark Martinec wrote: >> In short, after upgrading to 10.1-BETA3 or -RC1 I ended up with several >> zfs pools that can no longer be imported. The zpool import command >> (with no arguments) does show all available pools, but trying to >> import one just hangs and the command cannot be aborted, although >> the rest of the system is still alive and fine: >> >> # zpool import -f <mypool> >> >> Typing ^T just shows an idle process, waiting on [tx->tx_sync_done_cv]: >> >> load: 0.20 cmd: zpool 939 [tx->tx_sync_done_cv] 5723.65r 0.01u 0.02s 0% 8220k >> load: 0.16 cmd: zpool 939 [tx->tx_sync_done_cv] 5735.73r 0.01u 0.02s 0% 8220k >> load: 0.14 cmd: zpool 939 [tx->tx_sync_done_cv] 5741.83r 0.01u 0.02s 0% 8220k >> load: 0.13 cmd: zpool 939 [tx->tx_sync_done_cv] 5749.16r 0.01u 0.02s 0% 8220k >> >> ps shows (on a system re-booted to a LiveCD running FreeBSD-10.1-RC1): >> >> PID TID COMM TDNAME CPU PRI STATE WCHAN >> 939 100632 zpool - 5 122 sleep tx->tx_s >> UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND >> 0 939 801 0 22 0 107732 8236 tx->tx_s D+ v0 0:00.04 >> zpool import -f -o cachefile=/tmp/zpool.cache -R /tmp/sys0boot sys0boot >> >> NWCHAN >> fffff8007b0f2a20 >> >> # procstat -kk 939 >> >> PID TID COMM TDNAME KSTACK >> 939 100632 zpool - mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d txg_wait_synced+0x85 >> spa_load+0x1cd1 spa_load_best+0x6f spa_import+0x1ff zfs_ioc_pool_import+0x137 zfsdev_ioctl+0x6f0 devfs_ioctl_f+0x114 >> kern_ioctl+0x255 sys_ioctl+0x13c amd64_syscall+0x351 Xfast_syscall+0xfb >> >> >> Background story: the system where this happened was being kept >> to a fairly recent 10-STABLE. The last upgrade was very close to >> a BETA3 release. There are a couple of zfs pools there, one on a >> mirrored pair of SSDs mostly holding the OS, one with a mirrored >> pair of large spindles, and three more small ones (4 GiB each), >> mostly for boot redundancy or testing - these small ones are on >> old smallish disks. These disks are different, and attached to >> different SATA controllers (LSI and onboard Intel). Pools were >> mostly kept up-to-date to the most recent zpool features set >> through their lifetime (some starting their life with 9.0, some >> with 10.0). >> >> About two weeks ago after a reboot to a 10-STABLE of the day >> the small pools became unavailable, but the regular two large >> pools were still normal. At first I wasn't giving much attention >> to that, as these pools were on oldish disks and nonessential >> for normal operation, blaming a potentially crappy hw. >> >> Today I needed to do a reboot (for unrelated reason), and the >> machine was no longer able to mount the boot pool. >> The first instinct was - disks are malfunctioning - but ... >> >> Booting it to a FreeBSD-10.1-RC1 LiveCD was successful. >> smartmon disk test shows no problems. dd is able to read whole >> partititions of each problematic pool. And most importantly, >> running a 'zdb -e -cc' on each (non-imported) pool was churning >> normally and steadily, producing a stats report at the end >> and reported no errors. >> >> As a final proof that disks are fine I sacrificed one of the broken >> 4 GiB GPT partitions with one of the problematic pools, and >> did a fresh 10.1-RC1 install on it from a distribution ISO DVD. >> The installation went fine and the system does boot and run >> fine from the newly installed OS. Trying to import one of the >> remaining old pools hangs the import command as before. >> >> As a final proof, I copied (with dd) one of the broken 4 GiB >> partitions to a file on another system (running 10.1-BETA3, >> which did not suffer from this problem), made a memory disk >> out of this file, then run zfs import on this pool - and it hangs >> there too! So hardware was not a problem - either these partitions >> are truly broken (even though zdb -cc says they are fine), >> or the new OS is somehow no longer able to import them. >> >> Please advise. >> >> I have a copy of the 4 GiB partition on a 400 MB compressed >> file available, if somebody would be willing to play with it. >> >> Also have a ktrace of the 'zpool import' command. It's last >> actions before it hangs are: >> >> 939 zpool RET madvise 0 >> 939 zpool CALL madvise(0x80604e000,0x1000,MADV_FREE) >> 939 zpool RET madvise 0 >> 939 zpool CALL close(0x6) >> 939 zpool RET close 0 >> 939 zpool CALL ioctl(0x3,0xc0185a05,0x7fffffffbf00) >> 939 zpool RET ioctl -1 errno 2 No such file or directory >> 939 zpool CALL madvise(0x802c71000,0x10000,MADV_FREE) >> 939 zpool RET madvise 0 >> 939 zpool CALL madvise(0x802ca5000,0x1000,MADV_FREE) >> 939 zpool RET madvise 0 >> 939 zpool CALL ioctl(0x3,0xc0185a06,0x7fffffffbf60) >> 939 zpool RET ioctl 0 >> 939 zpool CALL ioctl(0x3,0xc0185a06,0x7fffffffbf60) >> 939 zpool RET ioctl 0 >> 939 zpool CALL stat(0x802c380e0,0x7fffffffbc58) >> 939 zpool NAMI "/tmp" >> 939 zpool STRU struct stat {dev=273, ino=2, mode=041777, nlink=8, uid=0, gid=0, rdev=96, atime=1412866648, >> stime=1412871393, ctime=1412871393, birthtime=1412866648, size=512, blksize=32768, blocks=8, flags=0x0 } >> 939 zpool RET stat 0 >> 939 zpool CALL ioctl(0x3,0xc0185a02,0x7fffffffbc60) >> >> >> Mark > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A5BA41116A7F4B23A9C9E469C4146B99>