Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Oct 2018 22:22:32 +0000
From:      bugzilla-noreply@freebsd.org
To:        fs@FreeBSD.org
Subject:   [Bug 227784] zfs: Fatal trap 9: general protection fault while in kernel mode on shutdown
Message-ID:  <bug-227784-3630-pgX8govX8V@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-227784-3630@https.bugs.freebsd.org/bugzilla/>
References:  <bug-227784-3630@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D227784

Mark Johnston <markj@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |allanjude@FreeBSD.org,
                   |                            |mav@FreeBSD.org

--- Comment #15 from Mark Johnston <markj@FreeBSD.org> ---
I took at a look at a vmcore provided by wulf@.  At the time of the panic, =
the
kernel was waiting for MOS dnode dbuf evictions to finsh:

(kgdb) bt
#0  sched_switch (td=3D0xfffff800035d3000, newtd=3D0xfffff800035d2580,
flags=3D<optimized out>) at /usr/src/sys/kern/sched_ule.c:2112
#1  0xffffffff806a759f in mi_switch (flags=3D260, newtd=3D0x0) at
/usr/src/sys/kern/kern_synch.c:439
#2  0xffffffff806f0d8d in sleepq_switch (wchan=3D0xfffffe008dffe390, pri=3D=
0) at
/usr/src/sys/kern/subr_sleepqueue.c:613
#3  0xffffffff806f0c33 in sleepq_wait (wchan=3D0xfffffe008dffe390, pri=3D0)=
 at
/usr/src/sys/kern/subr_sleepqueue.c:692
#4  0xffffffff806381f3 in _cv_wait (cvp=3D0xfffffe008dffe390, lock=3D<optim=
ized
out>) at /usr/src/sys/kern/kern_condvar.c:146
#5  0xffffffff8039d5db in spa_evicting_os_wait (spa=3D<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c:1=
959
#6  0xffffffff8038ad9b in spa_deactivate (spa=3D0xfffffe008dffe000) at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:1272
#7  0xffffffff80393b88 in spa_evict_all () at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:8350
#8  0xffffffff8039dade in spa_fini () at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c:2141
#9  0xffffffff803e6bdc in zfs__fini () at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:7109
#10 0xffffffff8069bf86 in kern_reboot (howto=3D16392) at
/usr/src/sys/kern/kern_shutdown.c:443
#11 0xffffffff8069bb4a in sys_reboot (td=3D<optimized out>,
uap=3D0xfffff800035d33c0) at /usr/src/sys/kern/kern_shutdown.c:280

At this point, the spa_unload() call preceding the spa_deactivate() call had
already freed the pool.  However, dsl_pool_close() calls
dmu_buf_user_evict_wait() after kicking off evictions of top-level director=
ies:

 452         /*=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 453          * Drop our references from dsl_pool_open().=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 454          *=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 455          * Since we held the origin_sintnap from "syncing" context (wh=
ich=20=20
 456          * includes pool-opening context), it actually only got a "ref=
"=20=20=20=20
 457          * and not a hold, so just drop that here.=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 458          */=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 459         if (dp->dp_origin_snap !=3D NULL)=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20
 460                 dsl_dataset_rele(dp->dp_origin_snap, dp);=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 461         if (dp->dp_mos_dir !=3D NULL)=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20
 462                 dsl_dir_rele(dp->dp_mos_dir, dp);=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 463         if (dp->dp_free_dir !=3D NULL)=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20
 464                 dsl_dir_rele(dp->dp_free_dir, dp);=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 465         if (dp->dp_leak_dir !=3D NULL)=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20
 466                 dsl_dir_rele(dp->dp_leak_dir, dp);=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
 467         if (dp->dp_root_dir !=3D NULL)=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20
 468                 dsl_dir_rele(dp->dp_root_dir, dp);
...
 496         dmu_buf_user_evict_wait();=20

Looking a bit at the dbuf:

(kgdb) frame 12
#12 0xffffffff8036221c in dsl_dir_evict_async (dbu=3D0xfffff800053da400)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:158
158             spa_async_close(dd->dd_pool->dp_spa, dd);
(kgdb) p dd->dd_myname
$42 =3D "$ORIGIN", '\000' <repeats 248 times>
(kgdb) p dd->dd_parent->dd_myname
$43 =3D "u01", '\000' <repeats 252 times>

I'm not sure what $ORIGIN is; I guess it's some ZFS metadata.

I looked at taskq_wait() in FreeBSD vs. illumos.  On FreeBSD it will only
wait for currently queued tasks to finish; anything enqueued after the drain
starts may not be finished by the time we return.  On illumos it looks like
taskq_wait() will wait until the queue is completely empty.  So, if the asy=
nc
evictions queue some additional evictions, on FreeBSD we won't recursively
wait,
and the taskq_wait() will return early.  I can't tell if ZFS is making this
assumption though.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-227784-3630-pgX8govX8V>