From owner-freebsd-current@FreeBSD.ORG  Tue Dec 12 12:49:47 2006
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: freebsd-current@FreeBSD.org
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2126016A709;
	Tue, 12 Dec 2006 12:49:47 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E06FD43CC5;
	Tue, 12 Dec 2006 12:48:23 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 1373B5A3EFF;
	Tue, 12 Dec 2006 23:49:45 +1100 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id 06F1727408;
	Tue, 12 Dec 2006 23:49:42 +1100 (EST)
Date: Tue, 12 Dec 2006 23:49:42 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Kostik Belousov <kostikbel@gmail.com>
In-Reply-To: <20061212101903.GF311@deviant.kiev.zoral.com.ua>
Message-ID: <20061212220705.F57430@delplex.bde.org>
References: <456950AF.3090308@sh.cvut.cz>
	<20061127092146.GA69556@deviant.kiev.zoral.com.ua>
	<457E6C06.1020405@FreeBSD.org>
	<20061212101903.GF311@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Tue, 12 Dec 2006 13:11:19 +0000
Cc: V??clav Haisman <V.Haisman@sh.cvut.cz>, freebsd-stable@FreeBSD.org,
	Suleiman Souhlal <ssouhlal@FreeBSD.org>,
	freebsd-current@FreeBSD.org, bde@FreeBSD.org, tegge@FreeBSD.org
Subject: Re: kqueue LOR
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Dec 2006 12:49:47 -0000

On Tue, 12 Dec 2006, Kostik Belousov wrote:

> On Tue, Dec 12, 2006 at 12:44:54AM -0800, Suleiman Souhlal wrote:

>> Is the mount lock really required, if all we're doing is a single read of a
>> single word (mnt_kern_flags) (v_mount should be read-only for the whole
>> lifetime of the vnode, I believe)? After all, reads of a single word are
>> atomic on all our supported architectures.
>> The only situation I see where there MIGHT be problems are forced unmounts,
>> but I think there are bigger issues with those.
>> Sorry for noticing this email only now.
>
> The problem is real with snapshotting. Ignoring
> MNTK_SUSPEND/MNTK_SUSPENDED flags (in particular, reading stale value of
> mnt_kern_flag) while setting IN_MODIFIED caused deadlock at ufs vnode
> inactivation time. This was the big trouble with nfsd and snapshots. As
> such, I think that precise value of mmnt_kern_flag is critical there,
> and mount interlock is needed.

Locking for just read is almost always bogus, but here (as in most
cases) there is also a write based on the contents of the flag, and
the lock is held across the write.

> Practically speaking, I agree with claim that reading of m_k_f is
> surrounded by enough locked operations that would make sure that
> the read value is not stale. But there is no such guarantee on
> future/non-i386 arches, isn't it ?

I think not-very-staleness is implied by acquire/release semantics
which are part of the API for most atomic operations.  This behaviour
doesn't seem to be documented for mutexes, but I don't see how mutexes
could work without it (they have to synchronize all memory accesses,
not just the memory accessed by the lock).

> As a side note, mount interlock scope could be reduced there.
>
> Index: ufs/ufs/ufs_vnops.c
> ===================================================================
> RCS file: /usr/local/arch/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
> retrieving revision 1.283
> diff -u -r1.283 ufs_vnops.c
> --- ufs/ufs/ufs_vnops.c	6 Nov 2006 13:42:09 -0000	1.283
> +++ ufs/ufs/ufs_vnops.c	12 Dec 2006 10:18:04 -0000
> @@ -133,19 +134,19 @@
> {
> 	struct inode *ip;
> 	struct timespec ts;
> -	int mnt_locked;
>
> 	ip = VTOI(vp);
> -	mnt_locked = 0;
> 	if ((vp->v_mount->mnt_flag & MNT_RDONLY) != 0) {
> 		VI_LOCK(vp);
> 		goto out;
> 	}
> 	MNT_ILOCK(vp->v_mount);		/* For reading of mnt_kern_flags. */
> -	mnt_locked = 1;
> 	VI_LOCK(vp);
> -	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0)
> -		goto out_unl;
> +	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) {
> +		MNT_IUNLOCK(vp->v_mount);
> +		VI_UNLOCK(vp);
> +		return;
> +	}

The version that depends on not-very-staleness would test the flags
without acquiring the lock(s) and return immediately in the usual case
where none of the flags are set.  It would have to acquire the locks
and repeat the test to make changes (and the test is already repeated
one flag at a time).  I think this would be correct enough, but still
inefficient and/or even messier.  The current organization is usually:

     acquire vnode interlock in caller
     release vnode interlock in caller to avoid messes here (inefficient)
     call
         acquire mount interlock
         acquire vnode interlock
         test the flags; goto cleanup code if none set (usual case)
         do the work
         release vnode interlock
         release mount interlock
         return
     acquire vnode interlock (if needed)
     release vnode interlock (if needed)

and it might become:

     acquire vnode interlock in caller
     call
         test the flags; return if none set (usual case)
         release vnode interlock	// check that callers are aware of this
         acquire mount interlock
         acquire vnode interlock
         do the work
 	// Assume no LOR problem for release, as below.
 	// Otherwise need another relese+acquire of vnode interlock.
         release mount interlock
         return
     release vnode interlock

>
> 	if ((vp->v_type == VBLK || vp->v_type == VCHR) && !DOINGSOFTDEP(vp))
> 		ip->i_flag |= IN_LAZYMOD;
> @@ -155,6 +156,7 @@
> 		ip->i_flag |= IN_MODIFIED;
> 	else if (ip->i_flag & IN_ACCESS)
> 		ip->i_flag |= IN_LAZYACCESS;
> +	MNT_IUNLOCK(vp->v_mount);
> 	vfs_timestamp(&ts);
> 	if (ip->i_flag & IN_ACCESS) {
> 		DIP_SET(ip, i_atime, ts.tv_sec);

Is there no LOR problem for release?

As I understand it, MNT_ILOCK() is only protecting IN_ACCESS being
converted to IN_MODIFED, so after this conversion is done the lock
is not needed.  Is this correct?

> @@ -172,10 +174,7 @@
>
>  out:
> 	ip->i_flag &= ~(IN_ACCESS | IN_CHANGE | IN_UPDATE);
> - out_unl:
> 	VI_UNLOCK(vp);
> -	if (mnt_locked)
> -		MNT_IUNLOCK(vp->v_mount);
> }
>
> /*
>

BTW, vfs.lookup_shared defaults to 0 and decides shared access for all
operations including read, so I wonder if there are [m]any bugs
preventing shared accesses being the default for the most important
cases where they can be (lookup and reads).  (As you know, not acquiring
the vnode interlock in old versions of the above was one, and since
vfs.lookup_shared is global and not all file systems have the fix, it
still is one.)

Bruce