From owner-freebsd-stable@FreeBSD.ORG Sat Dec 22 23:20:37 2007 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6AE3016A417; Sat, 22 Dec 2007 23:20:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 143A113C44B; Sat, 22 Dec 2007 23:20:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lBMNKVRO030159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 23 Dec 2007 10:20:33 +1100 Date: Sun, 23 Dec 2007 10:20:31 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Kostik Belousov In-Reply-To: <20071222201613.GX57756@deviant.kiev.zoral.com.ua> Message-ID: <20071223095314.G1323@delplex.bde.org> References: <20071221234347.GS25053@tnn.dglawrence.com> <20071222050743.GP57756@deviant.kiev.zoral.com.ua> <20071223032944.G48303@delplex.bde.org> <20071222201613.GX57756@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: "Freebsd-Net@Freebsd. Org" , freebsd-stable@freebsd.org, Bruce Evans Subject: Re: Packet loss every 30.999 seconds X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2007 23:20:37 -0000 On Sat, 22 Dec 2007, Kostik Belousov wrote: > On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote: >> On Sat, 22 Dec 2007, Kostik Belousov wrote: >>> Yes, rewriting the syncer is the right solution. It probably cannot be done >>> quickly enough. If the yield workaround provide mitigation for now, it >>> shall go in. >> >> I don't think rewriting the syncer just for this is the right solution. >> Rewriting the syncer so that it schedules actual i/o more efficiently >> might involve a solution. Better scheduling would probably take more >> CPU and increase the problem. > I think that we can easily predict what vnode(s) become dirty at the > places where we do vn_start_write(). This works for writes to regular files at most. There are also reads (for ffs, these set IN_ATIME unless the file system is mounted with noatime) and directory operations. By grepping for IN_CHANGE, I get 78 places in ffs alone where dirtying of the inode occurs or is scheduled to occur (ffs = /sys/ufs). The efficiency of "marking" timestamps, especially for atimes, depends on just setting a flag in normal operation and picking up coalesced settings of the flag later, often at sync time by scanning all vnodes. >> Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is >> needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). >> There are 4 places in vfs and 13 places in 6 file systems: >> ... >> >> Only file systems that support writing need it (for VOP_SYNC() and for >> MNT_RELOAD), else there would be many more places. There would also >> be more places if MNT_RELOAD support were not missing for some file >> systems. > > Ok, since you talked about this first :). I already made the following > patch, but did not published it since I still did not inspected all > callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. > It shall be safe, but better to check. Also, I postponed the check > until it was reported that yielding does solve the original problem. Good. I'd still like to unobfuscate the function call. > diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c > index 14acc5b..046af82 100644 > --- a/sys/kern/vfs_mount.c > +++ b/sys/kern/vfs_mount.c > @@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp) > mtx_assert(MNT_MTX(mp), MA_OWNED); > > KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch")); > + if ((*mvp)->v_yield++ == 500) { > + MNT_IUNLOCK(mp); > + (*mvp)->v_yield = 0; > + uio_yield(); Another unobfuscation is to not name this uio_yield(). > + MNT_ILOCK(mp); > + } > vp = TAILQ_NEXT(*mvp, v_nmntvnodes); > while (vp != NULL && vp->v_type == VMARKER) > vp = TAILQ_NEXT(vp, v_nmntvnodes); > diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h > index dc70417..6e3119b 100644 > --- a/sys/sys/vnode.h > +++ b/sys/sys/vnode.h > @@ -131,6 +131,7 @@ struct vnode { > struct socket *vu_socket; /* v unix domain net (VSOCK) */ > struct cdev *vu_cdev; /* v device (VCHR, VBLK) */ > struct fifoinfo *vu_fifoinfo; /* v fifo (VFIFO) */ > + int vu_yield; /* yield count (VMARKER) */ > } v_un; > > /* Putting the count in the union seems fragile at best. Even if nothing can access the marker vnode, you need to context-switch its old contents while using it for the count, in case its old contents is used. Vnode- printing routines might still be confused. Bruce