From owner-freebsd-stable@FreeBSD.ORG Sat Dec 22 17:08:15 2007 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 509EF16A418; Sat, 22 Dec 2007 17:08:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id EA4AA13C46B; Sat, 22 Dec 2007 17:08:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lBMH897M022272 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 23 Dec 2007 04:08:11 +1100 Date: Sun, 23 Dec 2007 04:08:09 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Kostik Belousov In-Reply-To: <20071222050743.GP57756@deviant.kiev.zoral.com.ua> Message-ID: <20071223032944.G48303@delplex.bde.org> References: <20071221234347.GS25053@tnn.dglawrence.com> <20071222050743.GP57756@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: "Freebsd-Net@Freebsd. Org" , freebsd-stable@FreeBSD.org Subject: Re: Packet loss every 30.999 seconds X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2007 17:08:15 -0000 On Sat, 22 Dec 2007, Kostik Belousov wrote: > On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote: >> >> I'm just an observer, and I may be confused, but it seems to me that this is >> motion in the wrong direction (at least, it's not going to fix the actual >> problem). As I understand the problem, once you reach a certain point, the >> system slows down *every* 30.999 seconds. Now, it's possible for the code to >> cause one slowdown as it cleans up, but why does it need to clean up so much >> 31 seconds later? It is just searching for things to clean up, and doing this pessimally due to unnecessary cache misses and (more recently) introduction of overheads to handling the case where the mount point is locked into the fast path where the mount point is not unlocked. The search every 30 seconds or so is probably more efficient, and is certainly simpler, than managing the list on every change to every vnode for every file system. However, it gives a high latency in non-preemptible kernels. >> Why not find/fix the actual bug? Then work on getting the yield right if it >> turns out there's an actual problem for it to fix. Yielding is probably the correct fix for non-preemptible kernels. Some operations just take a long time, but are low priority so they can be preempted. This operation is partly under user control, since any user can call sync(2) and thus generate the latency every seconds. But this is no worse than a user generating even larger blocks of latency by reading huge amounts from /dev/zero. My old latency workaround for the latter (and other huge i/o's) is still sort of necessary, though it now works bogusly (hogticks doesn't work since it is reset on context switches to interrupt handlers; however, any context switch mostly fixes the problem). My old latency workaround only reduces the latency to a multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow latencies much larger than the ones that cause problems here, but its bogus current operation tends to give latencies of more like 1/HZ which is short enough when HZ has its default misconfiguration to 1000. I still don't understand the original problem, that the kernel is not even preemptible enough for network interrupts to work (except in 5.2 where Giant breaks things). Perhaps I misread the problem, and it is actually that networking works but userland is unable to run in time to avoid packet loss. >> If the problem is that too much work is being done at a stretch and it turns >> out this is because work is being done erroneously or needlessly, fixing >> that should solve the whole problem. Doing the work that doesn't need to be >> done more slowly is at best an ugly workaround. Lots of necessary work is being done. > Yes, rewriting the syncer is the right solution. It probably cannot be done > quickly enough. If the yield workaround provide mitigation for now, it > shall go in. I don't think rewriting the syncer just for this is the right solution. Rewriting the syncer so that it schedules actual i/o more efficiently might involve a solution. Better scheduling would probably take more CPU and increase the problem. Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). There are 4 places in vfs and 13 places in 6 file systems: % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(xvp, mp, mvp) { % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./fs/msdosfs/msdosfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, nvp) { % ./fs/coda/coda_subr.c: MNT_VNODE_FOREACH(vp, mp, nvp) { % ./gnu/fs/ext2fs/ext2_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./gnu/fs/ext2fs/ext2_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_default.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_subr.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_subr.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./nfs4client/nfs4_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./nfsclient/nfs_subs.c: MNT_VNODE_FOREACH(vp, mp, nvp) { % ./nfsclient/nfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { Only file systems that support writing need it (for VOP_SYNC() and for MNT_RELOAD), else there would be many more places. There would also be more places if MNT_RELOAD support were not missing for some file systems. Bruce