Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Dec 2006 15:52:51 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        freebsd-stable@freebsd.org, freebsd-current@freebsd.org, Suleiman Souhlal <ssouhlal@freebsd.org>, V??clav Haisman <V.Haisman@sh.cvut.cz>, bde@freebsd.org, tegge@freebsd.org
Subject:   Re: kqueue LOR
Message-ID:  <20061212135251.GJ311@deviant.kiev.zoral.com.ua>
In-Reply-To: <20061212220705.F57430@delplex.bde.org>
References:  <456950AF.3090308@sh.cvut.cz> <20061127092146.GA69556@deviant.kiev.zoral.com.ua> <457E6C06.1020405@FreeBSD.org> <20061212101903.GF311@deviant.kiev.zoral.com.ua> <20061212220705.F57430@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--v2Uk6McLiE8OV1El
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 12, 2006 at 11:49:42PM +1100, Bruce Evans wrote:
> On Tue, 12 Dec 2006, Kostik Belousov wrote:
>=20
> >On Tue, Dec 12, 2006 at 12:44:54AM -0800, Suleiman Souhlal wrote:
>=20
> >>Is the mount lock really required, if all we're doing is a single read =
of=20
> >>a
> >>single word (mnt_kern_flags) (v_mount should be read-only for the whole
> >>lifetime of the vnode, I believe)? After all, reads of a single word are
> >>atomic on all our supported architectures.
> >>The only situation I see where there MIGHT be problems are forced=20
> >>unmounts,
> >>but I think there are bigger issues with those.
> >>Sorry for noticing this email only now.
> >
> >The problem is real with snapshotting. Ignoring
> >MNTK_SUSPEND/MNTK_SUSPENDED flags (in particular, reading stale value of
> >mnt_kern_flag) while setting IN_MODIFIED caused deadlock at ufs vnode
> >inactivation time. This was the big trouble with nfsd and snapshots. As
> >such, I think that precise value of mmnt_kern_flag is critical there,
> >and mount interlock is needed.
>=20
> Locking for just read is almost always bogus, but here (as in most
> cases) there is also a write based on the contents of the flag, and
> the lock is held across the write.
>=20
> >Practically speaking, I agree with claim that reading of m_k_f is
> >surrounded by enough locked operations that would make sure that
> >the read value is not stale. But there is no such guarantee on
> >future/non-i386 arches, isn't it ?
>=20
> I think not-very-staleness is implied by acquire/release semantics
> which are part of the API for most atomic operations.  This behaviour
> doesn't seem to be documented for mutexes, but I don't see how mutexes
> could work without it (they have to synchronize all memory accesses,
> not just the memory accessed by the lock).
>=20
> >As a side note, mount interlock scope could be reduced there.
> >
> >Index: ufs/ufs/ufs_vnops.c
> >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
> >retrieving revision 1.283
> >diff -u -r1.283 ufs_vnops.c
> >--- ufs/ufs/ufs_vnops.c	6 Nov 2006 13:42:09 -0000	1.283
> >+++ ufs/ufs/ufs_vnops.c	12 Dec 2006 10:18:04 -0000
> >@@ -133,19 +134,19 @@
> >{
> >	struct inode *ip;
> >	struct timespec ts;
> >-	int mnt_locked;
> >
> >	ip =3D VTOI(vp);
> >-	mnt_locked =3D 0;
> >	if ((vp->v_mount->mnt_flag & MNT_RDONLY) !=3D 0) {
> >		VI_LOCK(vp);
> >		goto out;
> >	}
> >	MNT_ILOCK(vp->v_mount);		/* For reading of mnt_kern_flags. */
> >-	mnt_locked =3D 1;
> >	VI_LOCK(vp);
> >-	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) =3D=3D 0)
> >-		goto out_unl;
> >+	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) =3D=3D 0) {
> >+		MNT_IUNLOCK(vp->v_mount);
> >+		VI_UNLOCK(vp);
> >+		return;
> >+	}
>=20
> The version that depends on not-very-staleness would test the flags
> without acquiring the lock(s) and return immediately in the usual case
> where none of the flags are set.  It would have to acquire the locks
> and repeat the test to make changes (and the test is already repeated
> one flag at a time).  I think this would be correct enough, but still
> inefficient and/or even messier.  The current organization is usually:
>=20
>     acquire vnode interlock in caller
>     release vnode interlock in caller to avoid messes here (inefficient)
>     call
>         acquire mount interlock
>         acquire vnode interlock
>         test the flags; goto cleanup code if none set (usual case)
>         do the work
>         release vnode interlock
>         release mount interlock
>         return
>     acquire vnode interlock (if needed)
>     release vnode interlock (if needed)
>=20
> and it might become:
>=20
>     acquire vnode interlock in caller
>     call
>         test the flags; return if none set (usual case)
>         release vnode interlock	// check that callers are aware of=20
>         this
>         acquire mount interlock
>         acquire vnode interlock
>         do the work
> 	// Assume no LOR problem for release, as below.
> 	// Otherwise need another relese+acquire of vnode interlock.
>         release mount interlock
>         return
>     release vnode interlock
>=20
> >
> >	if ((vp->v_type =3D=3D VBLK || vp->v_type =3D=3D VCHR) && !DOINGSOFTDEP=
(vp))
> >		ip->i_flag |=3D IN_LAZYMOD;
> >@@ -155,6 +156,7 @@
> >		ip->i_flag |=3D IN_MODIFIED;
> >	else if (ip->i_flag & IN_ACCESS)
> >		ip->i_flag |=3D IN_LAZYACCESS;
> >+	MNT_IUNLOCK(vp->v_mount);
> >	vfs_timestamp(&ts);
> >	if (ip->i_flag & IN_ACCESS) {
> >		DIP_SET(ip, i_atime, ts.tv_sec);
>=20
> Is there no LOR problem for release?
AFAIK, no. Release is guaranteed to success without blocking.

>=20
> As I understand it, MNT_ILOCK() is only protecting IN_ACCESS being
> converted to IN_MODIFED, so after this conversion is done the lock
> is not needed.  Is this correct?
The code shall set IN_MODIFIED flag only if MNTK_SUSPEND and
MNTK_SUSPENDED are clean.

The vfs_write_suspend() that set MNTK_SUSPEND, does VFS_SYNC() after setting
the flag. Since VFS_SYNC() tries to interlock the vnode (to test the flags),
and then locks the vnode for ffs_syncvnode() call, the loop in ffs_sync()
cannot finish before the ufs_itimes released vnode interlock. The
MNTK_SUSPENDED is set after the loop.

Hmm, may be, since vnode must be interlocked by ffs_sync() after
MNTK_SUSPENDED set, and before MNTK_SUSPEND, mount interlock is not
needed in ufs_itimes.

Tor ?

>=20
> >@@ -172,10 +174,7 @@
> >
> > out:
> >	ip->i_flag &=3D ~(IN_ACCESS | IN_CHANGE | IN_UPDATE);
> >- out_unl:
> >	VI_UNLOCK(vp);
> >-	if (mnt_locked)
> >-		MNT_IUNLOCK(vp->v_mount);
> >}
> >
> >/*
> >
>=20
> BTW, vfs.lookup_shared defaults to 0 and decides shared access for all
> operations including read, so I wonder if there are [m]any bugs
> preventing shared accesses being the default for the most important
> cases where they can be (lookup and reads).  (As you know, not acquiring
> the vnode interlock in old versions of the above was one, and since
> vfs.lookup_shared is global and not all file systems have the fix, it
> still is one.)

Kris Kennaway said that lookup_shared caused deadlocks ?

--v2Uk6McLiE8OV1El
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFFfrQyC3+MBN1Mb4gRAlPCAJsHQNE3GIdPbTSmxyO75sjugSRqRQCeLkPr
TwT9hV93Txg6r3IQD4gMlb8=
=NzDM
-----END PGP SIGNATURE-----

--v2Uk6McLiE8OV1El--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061212135251.GJ311>