From owner-freebsd-stable@FreeBSD.ORG  Sat Dec 22 23:20:37 2007
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6AE3016A417;
	Sat, 22 Dec 2007 23:20:37 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id 143A113C44B;
	Sat, 22 Dec 2007 23:20:36 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	(c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lBMNKVRO030159
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 23 Dec 2007 10:20:33 +1100
Date: Sun, 23 Dec 2007 10:20:31 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Kostik Belousov <kostikbel@gmail.com>
In-Reply-To: <20071222201613.GX57756@deviant.kiev.zoral.com.ua>
Message-ID: <20071223095314.G1323@delplex.bde.org>
References: <20071221234347.GS25053@tnn.dglawrence.com>
	<MDEHLPKNGKAHNMBLJOLKMEKLJAAC.davids@webmaster.com>
	<20071222050743.GP57756@deviant.kiev.zoral.com.ua>
	<20071223032944.G48303@delplex.bde.org>
	<20071222201613.GX57756@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: "Freebsd-Net@Freebsd. Org" <freebsd-net@freebsd.org>,
	freebsd-stable@freebsd.org, Bruce Evans <brde@optusnet.com.au>
Subject: Re: Packet loss every 30.999 seconds
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 22 Dec 2007 23:20:37 -0000

On Sat, 22 Dec 2007, Kostik Belousov wrote:

> On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:
>> On Sat, 22 Dec 2007, Kostik Belousov wrote:
>>> Yes, rewriting the syncer is the right solution. It probably cannot be done
>>> quickly enough. If the yield workaround provide mitigation for now, it
>>> shall go in.
>>
>> I don't think rewriting the syncer just for this is the right solution.
>> Rewriting the syncer so that it schedules actual i/o more efficiently
>> might involve a solution.  Better scheduling would probably take more
>> CPU and increase the problem.
> I think that we can easily predict what vnode(s) become dirty at the
> places where we do vn_start_write().

This works for writes to regular files at most.  There are also reads
(for ffs, these set IN_ATIME unless the file system is mounted with
noatime) and directory operations.  By grepping for IN_CHANGE, I get
78 places in ffs alone where dirtying of the inode occurs or is scheduled
to occur (ffs = /sys/ufs).  The efficiency of "marking" timestamps,
especially for atimes, depends on just setting a flag in normal operation
and picking up coalesced settings of the flag later, often at sync time
by scanning all vnodes.

>> Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
>> needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
>> There are 4 places in vfs and 13 places in 6 file systems:
>> ...
>>
>> Only file systems that support writing need it (for VOP_SYNC() and for
>> MNT_RELOAD), else there would be many more places.  There would also
>> be more places if MNT_RELOAD support were not missing for some file
>> systems.
>
> Ok, since you talked about this first :). I already made the following
> patch, but did not published it since I still did not inspected all
> callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
> It shall be safe, but better to check. Also, I postponed the check
> until it was reported that yielding does solve the original problem.

Good.  I'd still like to unobfuscate the function call.

> diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
> index 14acc5b..046af82 100644
> --- a/sys/kern/vfs_mount.c
> +++ b/sys/kern/vfs_mount.c
> @@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp)
> 	mtx_assert(MNT_MTX(mp), MA_OWNED);
>
> 	KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch"));
> +	if ((*mvp)->v_yield++ == 500) {
> +		MNT_IUNLOCK(mp);
> +		(*mvp)->v_yield = 0;
> +		uio_yield();

Another unobfuscation is to not name this uio_yield().

> +		MNT_ILOCK(mp);
> +	}
> 	vp = TAILQ_NEXT(*mvp, v_nmntvnodes);
> 	while (vp != NULL && vp->v_type == VMARKER)
> 		vp = TAILQ_NEXT(vp, v_nmntvnodes);
> diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
> index dc70417..6e3119b 100644
> --- a/sys/sys/vnode.h
> +++ b/sys/sys/vnode.h
> @@ -131,6 +131,7 @@ struct vnode {
> 		struct socket	*vu_socket;	/* v unix domain net (VSOCK) */
> 		struct cdev	*vu_cdev; 	/* v device (VCHR, VBLK) */
> 		struct fifoinfo	*vu_fifoinfo;	/* v fifo (VFIFO) */
> +		int		vu_yield;	/*   yield count (VMARKER) */
> 	} v_un;
>
> 	/*

Putting the count in the union seems fragile at best.  Even if nothing
can access the marker vnode, you need to context-switch its old contents
while using it for the count, in case its old contents is used.  Vnode-
printing routines might still be confused.

Bruce