Date: Fri, 20 Dec 2013 14:44:05 +0100 From: Pawel Jakub Dawidek <pjd@FreeBSD.org> To: krichy@tvnetwork.hu Cc: freebsd-fs@freebsd.org Subject: Re: kern/184677 / ZFS snapshot handling deadlocks Message-ID: <20131220134405.GE1658@garage.freebsd.pl> In-Reply-To: <alpine.DEB.2.10.1312191629370.4344@krichy.tvnetwork.hu> References: <alpine.DEB.2.10.1312161647410.11599@krichy.tvnetwork.hu> <alpine.DEB.2.10.1312171326520.7714@krichy.tvnetwork.hu> <alpine.DEB.2.10.1312191629370.4344@krichy.tvnetwork.hu>
next in thread | previous in thread | raw e-mail | index | archive | help
--LKTjZJSUETSlgu2t Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 19, 2013 at 04:46:11PM +0100, krichy@tvnetwork.hu wrote: > Dear devs, >=20 > I am attaching a more clear patch/fix for my snapshot handling issues=20 > (0002), but I would be happy if some ZFS expert would comment it. I am=20 > trying to solve it at least for two weeks now, and an ACK or a NACK would= =20 > be nice from someone. Also a commit is reverted since it also caused=20 > deadlocks. I've read its comments, which also eliminates deadlocks, but I= =20 > did not find any reference how to produce that deadlock. In my view=20 > reverting that makes my issues disappear, but I dont know what new cases= =20 > will it rise. Richard, I won't be able to analyse it myself anytime soon, because of other obligations, but I forwarded you e-mail to the zfs-devel@ mailing list (it is closed, but gathers FreeBSD ZFS devs). Hopefully someone from there will be able to help you. > I've rewritten traverse() to make more like upstream, added two extra=20 > VN_HOLD()s to snapdir_lookup() when traverse returned the same vnode what= =20 > was passed to it (I dont know even that upon creating a snapshot vnode wh= y=20 > is that extra two holds needed for the vnode.) And unfortunately, due to= =20 > FreeBSD calls vop_inactive callbacks with vnodes locked, that could also= =20 > cause deadlocks, so zfsctl_snapshot_inactive() and=20 > zfsctl_snapshot_vptocnp() has been rewritten to work that around. >=20 > After this, one may also get a deadlock, when a simple access would call= =20 > into zfsctl_snapshot_lookup(). The documentation says, that those vnodes= =20 > should always be covered, but after some stress test, sometimes we hit=20 > that call, and that can cause again deadlocks. In our environment I've=20 > just uncommented that callback, which returns ENODIR on some calls, but a= t=20 > least not a deadlock. >=20 > The attached script can be used to reproduce my cases (would one confirm= =20 > that?), and after the patches applied, they disappear (confirm?). >=20 > Thanks, >=20 >=20 > Kojedzinszky Richard > Euronet Magyarorszag Informatikai Zrt. >=20 > On Tue, 17 Dec 2013, krichy@tvnetwork.hu wrote: >=20 > > Date: Tue, 17 Dec 2013 14:50:16 +0100 (CET) > > From: krichy@tvnetwork.hu > > To: pjd@freebsd.org > > Cc: freebsd-fs@freebsd.org > > Subject: Re: kern/184677 (fwd) > >=20 > > Dear devs, > > > > I will sum up my experience regarding the issue: > > > > The sympton is that a concurrent 'zfs send -R' and some activity on the= =20 > > snapshot dir (or in the snapshot) may cause a deadlock. > > > > After investigating the problem, I found that zfs send umounts the snap= shots,=20 > > and that causes the deadlock, so later I tested only with concurrent um= ount=20 > > and the "activity". More later I found that listing the snapshots in=20 > > .zfs/snapshot/ and unounting them can cause the found deadlock, so I us= ed=20 > > them for the tests. But for my surprise, instead of a deadlock, a recur= sive=20 > > lock panic has arised. > > > > The vnode for the ".zfs/snapshot/" directory contains zfs's zfsctl_snap= dir_t=20 > > structure (sdp). This contains a tree of mounted snapshots, and each en= try=20 > > (sep) contains the vnode of entry on which the snapshot is mounted on t= op=20 > > (se_root). The strange is that the se_root member does not hold a refer= ence=20 > > for the vnode, just a simple pointer to it. > > > > Upon entry lookup (zfsctl_snapdir_lookup()) the "snapshot" vnode is loc= ked,=20 > > the zfsctl_snapdir_t's tree is locked, and searched for the mount if it= =20 > > exists already. If it founds no entry, does the mount. In the case of a= n=20 > > entry was found, the se_root member contains the vnode which the snapsh= ot is=20 > > mounted on. Thus, a reference is taken for it, and the traverse() call = will=20 > > resolve to the real root vnode of the mounted snapshot, returning it as= =20 > > locked. (Examining the traverse() code I've found that it did not follo= w=20 > > FreeBSD's lock order recommendation described in sys/kern/vfs_subr.c.) > > > > On the other way, when an umount is issued, the se_root vnode looses it= s last=20 > > reference (as only the mountpoint holds one for it), it goes through th= e=20 > > vinactive() path, to zfsctl_snapshot_inactive(). In FreeBSD this is cal= led=20 > > with a locked vnode, so this is a deadlock race condition. While=20 > > zfsctl_snapdir_lookup() holds the mutex for the sdp tree, and traverse(= )=20 > > tries to acquire the se_root, zfsctl_snapshot_inactive() holds the lock= on=20 > > se_root while tries to access the sdp lock. > > > > The zfsctl_snapshot_inactive() has an if statement checking the v_useco= unt,=20 > > which is incremented in zfsctl_snapdir_lookup(), but in that context it= is=20 > > not covered by VI_LOCK. And it seems to me that FreeBSD's vinactive() p= ath=20 > > assumes that the vnode remains inactive (as opposed to illumos, at leas= t how=20 > > i read the code). So zfsctl_snapshot_inactive() must free resources whi= le in=20 > > a locked state. I was a bit confused, and probably that is why the prev= iously=20 > > posted patch is as is. > > > > Maybe if I had some clues on the directions of this problem, I could ha= ve=20 > > worked more for a nicer, shorter solution. > > > > Please someone comment on my post. > > > > Regards, > > > > > > > > Kojedzinszky Richard > > Euronet Magyarorszag Informatikai Zrt. > > > > On Mon, 16 Dec 2013, krichy@tvnetwork.hu wrote: > > > >> Date: Mon, 16 Dec 2013 16:52:16 +0100 (CET) > >> From: krichy@tvnetwork.hu > >> To: pjd@freebsd.org > >> Cc: freebsd-fs@freebsd.org > >> Subject: Re: kern/184677 (fwd) > >>=20 > >> Dear PJD, > >>=20 > >> I am a happy FreeBSD user, I am sure you've read my previous posts=20 > >> regarding some issues in ZFS. Please give some advice for me, where to= look=20 > >> for solutions, or how could I help to resolve those issues. > >>=20 > >> Regards, > >> Kojedzinszky Richard > >> Euronet Magyarorszag Informatikai Zrt. > >>=20 > >> ---------- Forwarded message ---------- > >> Date: Mon, 16 Dec 2013 15:23:06 +0100 (CET) > >> From: krichy@tvnetwork.hu > >> To: freebsd-fs@freebsd.org > >> Subject: Re: kern/184677 > >>=20 > >>=20 > >> Seems that pjd did a change which eliminated the zfsdev_state_lock on = Fri=20 > >> Aug 12 07:04:16 2011 +0000, which might introduced a new deadlock=20 > >> situation. Any comments on this? > >>=20 > >>=20 > >> Kojedzinszky Richard > >> Euronet Magyarorszag Informatikai Zrt. > >>=20 > >> On Mon, 16 Dec 2013, krichy@tvnetwork.hu wrote: > >>=20 > >>> Date: Mon, 16 Dec 2013 11:08:11 +0100 (CET) > >>> From: krichy@tvnetwork.hu > >>> To: freebsd-fs@freebsd.org > >>> Subject: kern/184677 > >>>=20 > >>> Dear devs, > >>>=20 > >>> I've attached a patch, which makes the recursive lockmgr disappear, a= nd=20 > >>> makes the reported bug to disappear. I dont know if I followed any=20 > >>> guidelines well, or not, but at least it works for me. Please some=20 > >>> ZFS/FreeBSD fs expert review it, and fix it where it needed. > >>>=20 > >>> But unfortunately, my original problem is still not solved, maybe the= same=20 > >>> as Ryan's:=20 > >>> http://lists.freebsd.org/pipermail/freebsd-fs/2013-December/018707.ht= ml > >>>=20 > >>> Tracing the problem down is that zfsctl_snapdir_lookup() tries to acq= uire=20 > >>> spa_namespace_lock while when finishing a zfs send -R does a=20 > >>> zfsdev_close(), and that also holds the same mutex. And this causes a= =20 > >>> deadlock scenario. I looked at illumos's code, and for some reason th= ey=20 > >>> use another mutex on zfsdev_close(), which therefore may not deadlock= with=20 > >>> zfsctl_snapdir_lookup(). But I am still investigating the problem. > >>>=20 > >>> I would like to help making ZFS more stable on freebsd also with its = whole=20 > >>> functionality. I would be very thankful if some expert would give som= e=20 > >>> advice, how to solve these bugs. PJD, Steven, Xin? > >>>=20 > >>> Thanks in advance, > >>>=20 > >>>=20 > >>> Kojedzinszky Richard > >>> Euronet Magyarorszag Informatikai Zrt. > >> _______________________________________________ > >> freebsd-fs@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > >>=20 > > > From 39298da838d006ad225e41529d7b7f240fccfe73 Mon Sep 17 00:00:00 2001 > From: Richard Kojedzinszky <krichy@cflinux.hu> > Date: Mon, 16 Dec 2013 15:39:11 +0100 > Subject: [PATCH 1/2] Revert "Eliminate the zfsdev_state_lock entirely and > replace it with the" >=20 > This reverts commit 1d8972b3f353f986eb5b85bc108b1c0d946d3218. >=20 > Conflicts: > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c > --- > .../opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h | 1 + > .../opensolaris/uts/common/fs/zfs/vdev_geom.c | 14 ++- > .../opensolaris/uts/common/fs/zfs/zfs_ioctl.c | 16 +-- > .../contrib/opensolaris/uts/common/fs/zfs/zvol.c | 119 +++++++++------= ------ > 4 files changed, 70 insertions(+), 80 deletions(-) >=20 > diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl= =2Eh b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h > index af2def2..8272c4d 100644 > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h > @@ -383,6 +383,7 @@ extern void *zfsdev_get_soft_state(minor_t minor, > extern minor_t zfsdev_minor_alloc(void); > =20 > extern void *zfsdev_state; > +extern kmutex_t zfsdev_state_lock; > =20 > #endif /* _KERNEL */ > =20 > diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c b= /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > index 15685a5..5c3e9f3 100644 > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > @@ -581,7 +581,7 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t = *max_psize, > struct g_provider *pp; > struct g_consumer *cp; > size_t bufsize; > - int error; > + int error, lock; > =20 > /* > * We must have a pathname, and it must be absolute. > @@ -593,6 +593,12 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t= *max_psize, > =20 > vd->vdev_tsd =3D NULL; > =20 > + if (mutex_owned(&spa_namespace_lock)) { > + mutex_exit(&spa_namespace_lock); > + lock =3D 1; > + } else { > + lock =3D 0; > + } > DROP_GIANT(); > g_topology_lock(); > error =3D 0; > @@ -624,7 +630,11 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t= *max_psize, > !ISP2(cp->provider->sectorsize)) { > ZFS_LOG(1, "Provider %s has unsupported sectorsize.", > vd->vdev_path); > + > + g_topology_lock(); > vdev_geom_detach(cp, 0); > + g_topology_unlock(); > + > error =3D EINVAL; > cp =3D NULL; > } else if (cp->acw =3D=3D 0 && (spa_mode(vd->vdev_spa) & FWRITE) !=3D 0= ) { > @@ -647,6 +657,8 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t = *max_psize, > } > g_topology_unlock(); > PICKUP_GIANT(); > + if (lock) > + mutex_enter(&spa_namespace_lock); > if (cp =3D=3D NULL) { > vd->vdev_stat.vs_aux =3D VDEV_AUX_OPEN_FAILED; > return (error); > diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c b= /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c > index e9fba26..91becde 100644 > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c > @@ -5635,7 +5635,7 @@ zfsdev_minor_alloc(void) > static minor_t last_minor; > minor_t m; > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > =20 > for (m =3D last_minor + 1; m !=3D last_minor; m++) { > if (m > ZFSDEV_MAX_MINOR) > @@ -5655,7 +5655,7 @@ zfs_ctldev_init(struct cdev *devp) > minor_t minor; > zfs_soft_state_t *zs; > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > =20 > minor =3D zfsdev_minor_alloc(); > if (minor =3D=3D 0) > @@ -5676,7 +5676,7 @@ zfs_ctldev_init(struct cdev *devp) > static void > zfs_ctldev_destroy(zfs_onexit_t *zo, minor_t minor) > { > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > =20 > zfs_onexit_destroy(zo); > ddi_soft_state_free(zfsdev_state, minor); > @@ -5706,9 +5706,9 @@ zfsdev_open(struct cdev *devp, int flag, int mode, = struct thread *td) > =20 > /* This is the control device. Allocate a new minor if requested. */ > if (flag & FEXCL) { > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > error =3D zfs_ctldev_init(devp); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > } > =20 > return (error); > @@ -5723,14 +5723,14 @@ zfsdev_close(void *data) > if (minor =3D=3D 0) > return; > =20 > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > zo =3D zfsdev_get_soft_state(minor, ZSST_CTLDEV); > if (zo =3D=3D NULL) { > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return; > } > zfs_ctldev_destroy(zo, minor); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > } > =20 > static int > diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c b/sys/= cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c > index 72d4502..aec5219 100644 > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c > @@ -104,11 +104,12 @@ static char *zvol_tag =3D "zvol_tag"; > #define ZVOL_DUMPSIZE "dumpsize" > =20 > /* > - * The spa_namespace_lock protects the zfsdev_state structure from being > - * modified while it's being used, e.g. an open that comes in before a > - * create finishes. It also protects temporary opens of the dataset so = that, > + * This lock protects the zfsdev_state structure from being modified > + * while it's being used, e.g. an open that comes in before a create > + * finishes. It also protects temporary opens of the dataset so that, > * e.g., an open doesn't get a spurious EBUSY. > */ > +kmutex_t zfsdev_state_lock; > static uint32_t zvol_minors; > =20 > typedef struct zvol_extent { > @@ -249,7 +250,7 @@ zvol_minor_lookup(const char *name) > struct g_geom *gp; > zvol_state_t *zv =3D NULL; > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > =20 > g_topology_lock(); > LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) { > @@ -465,11 +466,11 @@ zvol_name2minor(const char *name, minor_t *minor) > { > zvol_state_t *zv; > =20 > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > zv =3D zvol_minor_lookup(name); > if (minor && zv) > *minor =3D zv->zv_minor; > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (zv ? 0 : -1); > } > #endif /* sun */ > @@ -489,10 +490,10 @@ zvol_create_minor(const char *name) > =20 > ZFS_LOG(1, "Creating ZVOL %s...", name); > =20 > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > =20 > if (zvol_minor_lookup(name) !=3D NULL) { > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(EEXIST)); > } > =20 > @@ -500,20 +501,20 @@ zvol_create_minor(const char *name) > error =3D dmu_objset_own(name, DMU_OST_ZVOL, B_TRUE, FTAG, &os); > =20 > if (error) { > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (error); > } > =20 > #ifdef sun > if ((minor =3D zfsdev_minor_alloc()) =3D=3D 0) { > dmu_objset_disown(os, FTAG); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(ENXIO)); > } > =20 > if (ddi_soft_state_zalloc(zfsdev_state, minor) !=3D DDI_SUCCESS) { > dmu_objset_disown(os, FTAG); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(EAGAIN)); > } > (void) ddi_prop_update_string(minor, zfs_dip, ZVOL_PROP_NAME, > @@ -525,7 +526,7 @@ zvol_create_minor(const char *name) > minor, DDI_PSEUDO, 0) =3D=3D DDI_FAILURE) { > ddi_soft_state_free(zfsdev_state, minor); > dmu_objset_disown(os, FTAG); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(EAGAIN)); > } > =20 > @@ -536,7 +537,7 @@ zvol_create_minor(const char *name) > ddi_remove_minor_node(zfs_dip, chrbuf); > ddi_soft_state_free(zfsdev_state, minor); > dmu_objset_disown(os, FTAG); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(EAGAIN)); > } > =20 > @@ -587,7 +588,7 @@ zvol_create_minor(const char *name) > =20 > zvol_minors++; > =20 > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > =20 > zvol_geom_run(zv); > =20 > @@ -609,7 +610,7 @@ zvol_remove_zv(zvol_state_t *zv) > minor_t minor =3D zv->zv_minor; > #endif > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > if (zv->zv_total_opens !=3D 0) > return (SET_ERROR(EBUSY)); > =20 > @@ -635,15 +636,15 @@ zvol_remove_minor(const char *name) > zvol_state_t *zv; > int rc; > =20 > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > if ((zv =3D zvol_minor_lookup(name)) =3D=3D NULL) { > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(ENXIO)); > } > g_topology_lock(); > rc =3D zvol_remove_zv(zv); > g_topology_unlock(); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (rc); > } > =20 > @@ -755,7 +756,7 @@ zvol_update_volsize(objset_t *os, uint64_t volsize) > dmu_tx_t *tx; > int error; > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > =20 > tx =3D dmu_tx_create(os); > dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL); > @@ -786,7 +787,7 @@ zvol_remove_minors(const char *name) > namelen =3D strlen(name); > =20 > DROP_GIANT(); > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > g_topology_lock(); > =20 > LIST_FOREACH_SAFE(gp, &zfs_zvol_class.geom, geom, gptmp) { > @@ -804,7 +805,7 @@ zvol_remove_minors(const char *name) > } > =20 > g_topology_unlock(); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > PICKUP_GIANT(); > } > =20 > @@ -818,10 +819,10 @@ zvol_set_volsize(const char *name, major_t maj, uin= t64_t volsize) > uint64_t old_volsize =3D 0ULL; > uint64_t readonly; > =20 > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > zv =3D zvol_minor_lookup(name); > if ((error =3D dmu_objset_hold(name, FTAG, &os)) !=3D 0) { > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (error); > } > =20 > @@ -888,7 +889,7 @@ zvol_set_volsize(const char *name, major_t maj, uint6= 4_t volsize) > out: > dmu_objset_rele(os, FTAG); > =20 > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > =20 > return (error); > } > @@ -899,36 +900,19 @@ zvol_open(struct g_provider *pp, int flag, int coun= t) > { > zvol_state_t *zv; > int err =3D 0; > - boolean_t locked =3D B_FALSE; > =20 > - /* > - * Protect against recursively entering spa_namespace_lock > - * when spa_open() is used for a pool on a (local) ZVOL(s). > - * This is needed since we replaced upstream zfsdev_state_lock > - * with spa_namespace_lock in the ZVOL code. > - * We are using the same trick as spa_open(). > - * Note that calls in zvol_first_open which need to resolve > - * pool name to a spa object will enter spa_open() > - * recursively, but that function already has all the > - * necessary protection. > - */ > - if (!MUTEX_HELD(&spa_namespace_lock)) { > - mutex_enter(&spa_namespace_lock); > - locked =3D B_TRUE; > - } > + mutex_enter(&zfsdev_state_lock); > =20 > zv =3D pp->private; > if (zv =3D=3D NULL) { > - if (locked) > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(ENXIO)); > } > =20 > if (zv->zv_total_opens =3D=3D 0) > err =3D zvol_first_open(zv); > if (err) { > - if (locked) > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (err); > } > if ((flag & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) { > @@ -950,15 +934,13 @@ zvol_open(struct g_provider *pp, int flag, int coun= t) > #endif > =20 > zv->zv_total_opens +=3D count; > - if (locked) > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > =20 > return (err); > out: > if (zv->zv_total_opens =3D=3D 0) > zvol_last_close(zv); > - if (locked) > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (err); > } > =20 > @@ -968,18 +950,12 @@ zvol_close(struct g_provider *pp, int flag, int cou= nt) > { > zvol_state_t *zv; > int error =3D 0; > - boolean_t locked =3D B_FALSE; > =20 > - /* See comment in zvol_open(). */ > - if (!MUTEX_HELD(&spa_namespace_lock)) { > - mutex_enter(&spa_namespace_lock); > - locked =3D B_TRUE; > - } > + mutex_enter(&zfsdev_state_lock); > =20 > zv =3D pp->private; > if (zv =3D=3D NULL) { > - if (locked) > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(ENXIO)); > } > =20 > @@ -1002,8 +978,7 @@ zvol_close(struct g_provider *pp, int flag, int coun= t) > if (zv->zv_total_opens =3D=3D 0) > zvol_last_close(zv); > =20 > - if (locked) > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (error); > } > =20 > @@ -1658,12 +1633,12 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int = flag, cred_t *cr, int *rvalp) > int error =3D 0; > rl_t *rl; > =20 > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > =20 > zv =3D zfsdev_get_soft_state(getminor(dev), ZSST_ZVOL); > =20 > if (zv =3D=3D NULL) { > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (SET_ERROR(ENXIO)); > } > ASSERT(zv->zv_total_opens > 0); > @@ -1677,7 +1652,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int fl= ag, cred_t *cr, int *rvalp) > dki.dki_ctype =3D DKC_UNKNOWN; > dki.dki_unit =3D getminor(dev); > dki.dki_maxtransfer =3D 1 << (SPA_MAXBLOCKSHIFT - zv->zv_min_bs); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag)) > error =3D SET_ERROR(EFAULT); > return (error); > @@ -1687,7 +1662,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int fl= ag, cred_t *cr, int *rvalp) > dkm.dki_lbsize =3D 1U << zv->zv_min_bs; > dkm.dki_capacity =3D zv->zv_volsize >> zv->zv_min_bs; > dkm.dki_media_type =3D DK_UNKNOWN; > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > if (ddi_copyout(&dkm, (void *)arg, sizeof (dkm), flag)) > error =3D SET_ERROR(EFAULT); > return (error); > @@ -1697,14 +1672,14 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int = flag, cred_t *cr, int *rvalp) > uint64_t vs =3D zv->zv_volsize; > uint8_t bs =3D zv->zv_min_bs; > =20 > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > error =3D zvol_getefi((void *)arg, flag, vs, bs); > return (error); > } > =20 > case DKIOCFLUSHWRITECACHE: > dkc =3D (struct dk_callback *)arg; > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > zil_commit(zv->zv_zilog, ZVOL_OBJ); > if ((flag & FKIOCTL) && dkc !=3D NULL && dkc->dkc_callback) { > (*dkc->dkc_callback)(dkc->dkc_cookie, error); > @@ -1730,10 +1705,10 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int = flag, cred_t *cr, int *rvalp) > } > if (wce) { > zv->zv_flags |=3D ZVOL_WCE; > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > } else { > zv->zv_flags &=3D ~ZVOL_WCE; > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > zil_commit(zv->zv_zilog, ZVOL_OBJ); > } > return (0); > @@ -1828,7 +1803,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int fl= ag, cred_t *cr, int *rvalp) > break; > =20 > } > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > return (error); > } > #endif /* sun */ > @@ -1844,12 +1819,14 @@ zvol_init(void) > { > VERIFY(ddi_soft_state_init(&zfsdev_state, sizeof (zfs_soft_state_t), > 1) =3D=3D 0); > + mutex_init(&zfsdev_state_lock, NULL, MUTEX_DEFAULT, NULL); > ZFS_LOG(1, "ZVOL Initialized."); > } > =20 > void > zvol_fini(void) > { > + mutex_destroy(&zfsdev_state_lock); > ddi_soft_state_fini(&zfsdev_state); > ZFS_LOG(1, "ZVOL Deinitialized."); > } > @@ -1889,7 +1866,7 @@ zvol_dump_init(zvol_state_t *zv, boolean_t resize) > uint64_t version =3D spa_version(spa); > enum zio_checksum checksum; > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > ASSERT(vd->vdev_ops =3D=3D &vdev_root_ops); > =20 > error =3D dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, 0, > @@ -2437,7 +2414,7 @@ zvol_rename_minor(struct g_geom *gp, const char *ne= wname) > struct g_provider *pp; > zvol_state_t *zv; > =20 > - ASSERT(MUTEX_HELD(&spa_namespace_lock)); > + ASSERT(MUTEX_HELD(&zfsdev_state_lock)); > g_topology_assert(); > =20 > pp =3D LIST_FIRST(&gp->provider); > @@ -2471,7 +2448,7 @@ zvol_rename_minors(const char *oldname, const char = *newname) > newnamelen =3D strlen(newname); > =20 > DROP_GIANT(); > - mutex_enter(&spa_namespace_lock); > + mutex_enter(&zfsdev_state_lock); > g_topology_lock(); > =20 > LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) { > @@ -2494,6 +2471,6 @@ zvol_rename_minors(const char *oldname, const char = *newname) > } > =20 > g_topology_unlock(); > - mutex_exit(&spa_namespace_lock); > + mutex_exit(&zfsdev_state_lock); > PICKUP_GIANT(); > } > --=20 > 1.8.4.2 >=20 > From 57d5a383b585c32c77af54e8cdacaddf8ce7584f Mon Sep 17 00:00:00 2001 > From: Richard Kojedzinszky <krichy@cflinux.hu> > Date: Wed, 18 Dec 2013 22:11:21 +0100 > Subject: [PATCH 2/2] ZFS snapshot handling fix >=20 > --- > .../compat/opensolaris/kern/opensolaris_lookup.c | 13 +++--- > .../opensolaris/uts/common/fs/zfs/zfs_ctldir.c | 53 +++++++++++++++-= ------ > 2 files changed, 42 insertions(+), 24 deletions(-) >=20 > diff --git a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c b/sys/= cddl/compat/opensolaris/kern/opensolaris_lookup.c > index 94383d6..4cac053 100644 > --- a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c > +++ b/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c > @@ -81,6 +81,8 @@ traverse(vnode_t **cvpp, int lktype) > * progress on this vnode. > */ > =20 > + vn_lock(cvp, lktype); > + > for (;;) { > /* > * Reached the end of the mount chain? > @@ -89,13 +91,7 @@ traverse(vnode_t **cvpp, int lktype) > if (vfsp =3D=3D NULL) > break; > error =3D vfs_busy(vfsp, 0); > - /* > - * tvp is NULL for *cvpp vnode, which we can't unlock. > - */ > - if (tvp !=3D NULL) > - vput(cvp); > - else > - vrele(cvp); > + VOP_UNLOCK(cvp, 0); > if (error) > return (error); > =20 > @@ -107,6 +103,9 @@ traverse(vnode_t **cvpp, int lktype) > vfs_unbusy(vfsp); > if (error !=3D 0) > return (error); > + > + VN_RELE(cvp); > + > cvp =3D tvp; > } > =20 > diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c = b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c > index 28ab1fa..d3464b4 100644 > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c > @@ -112,6 +112,25 @@ snapentry_compare(const void *a, const void *b) > return (0); > } > =20 > +/* Return the zfsctl_snapdir_t object from current vnode, following > + * the lock orders in zfsctl_snapdir_lookup() to avoid deadlocks. > + * On return the passed in vp is unlocked */ > +static zfsctl_snapdir_t * > +zfsctl_snapshot_get_snapdir(vnode_t *vp, vnode_t **dvpp) > +{ > + gfs_dir_t *dp =3D vp->v_data; > + *dvpp =3D dp->gfsd_file.gfs_parent; > + zfsctl_snapdir_t *sdp; > + > + VN_HOLD(*dvpp); > + VOP_UNLOCK(vp, 0); > + vn_lock(*dvpp, LK_SHARED | LK_RETRY | LK_CANRECURSE); > + sdp =3D (*dvpp)->v_data; > + VOP_UNLOCK(*dvpp, 0); > + > + return (sdp); > +} > + > #ifdef sun > vnodeops_t *zfsctl_ops_root; > vnodeops_t *zfsctl_ops_snapdir; > @@ -1013,6 +1032,8 @@ zfsctl_snapdir_lookup(ap) > * The snapshot was unmounted behind our backs, > * try to remount it. > */ > + VOP_UNLOCK(*vpp, 0); > + VN_HOLD(*vpp); > VERIFY(zfsctl_snapshot_zname(dvp, nm, MAXNAMELEN, snapname) =3D=3D 0); > goto domount; > } else { > @@ -1064,7 +1085,6 @@ zfsctl_snapdir_lookup(ap) > sep->se_name =3D kmem_alloc(strlen(nm) + 1, KM_SLEEP); > (void) strcpy(sep->se_name, nm); > *vpp =3D sep->se_root =3D zfsctl_snapshot_mknode(dvp, dmu_objset_id(sna= p)); > - VN_HOLD(*vpp); > avl_insert(&sdp->sd_snaps, sep, where); > =20 > dmu_objset_rele(snap, FTAG); > @@ -1075,6 +1095,7 @@ domount: > (void) snprintf(mountpoint, mountpoint_len, > "%s/" ZFS_CTLDIR_NAME "/snapshot/%s", > dvp->v_vfsp->mnt_stat.f_mntonname, nm); > + VN_HOLD(*vpp); > err =3D mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0); > kmem_free(mountpoint, mountpoint_len); > if (err =3D=3D 0) { > @@ -1464,16 +1485,18 @@ zfsctl_snapshot_inactive(ap) > int locked; > vnode_t *dvp; > =20 > - if (vp->v_count > 0) > - goto end; > - > - VERIFY(gfs_dir_lookup(vp, "..", &dvp, cr, 0, NULL, NULL) =3D=3D 0); > - sdp =3D dvp->v_data; > - VOP_UNLOCK(dvp, 0); > + sdp =3D zfsctl_snapshot_get_snapdir(vp, &dvp); > =20 > if (!(locked =3D MUTEX_HELD(&sdp->sd_lock))) > mutex_enter(&sdp->sd_lock); > =20 > + vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); > + > + if (vp->v_count > 0) { > + mutex_exit(&sdp->sd_lock); > + return (0); > + } > + > ASSERT(!vn_ismntpt(vp)); > =20 > sep =3D avl_first(&sdp->sd_snaps); > @@ -1494,7 +1517,6 @@ zfsctl_snapshot_inactive(ap) > mutex_exit(&sdp->sd_lock); > VN_RELE(dvp); > =20 > -end: > /* > * Dispose of the vnode for the snapshot mount point. > * This is safe to do because once this entry has been removed > @@ -1595,20 +1617,17 @@ zfsctl_snapshot_lookup(ap) > static int > zfsctl_snapshot_vptocnp(struct vop_vptocnp_args *ap) > { > - zfsvfs_t *zfsvfs =3D ap->a_vp->v_vfsp->vfs_data; > - vnode_t *dvp, *vp; > + vnode_t *dvp, *vp =3D ap->a_vp; > zfsctl_snapdir_t *sdp; > zfs_snapentry_t *sep; > - int error; > + int error =3D 0; > =20 > - ASSERT(zfsvfs->z_ctldir !=3D NULL); > - error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp, > - NULL, 0, NULL, kcred, NULL, NULL, NULL); > - if (error !=3D 0) > - return (error); > - sdp =3D dvp->v_data; > + sdp =3D zfsctl_snapshot_get_snapdir(vp, &dvp); > =20 > mutex_enter(&sdp->sd_lock); > + > + vn_lock(vp, LK_SHARED | LK_RETRY); > + > sep =3D avl_first(&sdp->sd_snaps); > while (sep !=3D NULL) { > vp =3D sep->se_root; > --=20 > 1.8.4.2 >=20 --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://mobter.com --LKTjZJSUETSlgu2t Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (FreeBSD) iEYEARECAAYFAlK0SaUACgkQForvXbEpPzSzLACfddI+1gydBrna/vXLdDwR4+DW M2EAnROvev3FqMsIlPHznalQ1EyeeVXw =v+Ou -----END PGP SIGNATURE----- --LKTjZJSUETSlgu2t--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131220134405.GE1658>