From owner-freebsd-net@FreeBSD.ORG Tue Dec 18 21:49:31 2007 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9F23016A41A for ; Tue, 18 Dec 2007 21:49:31 +0000 (UTC) (envelope-from maf@eng.oar.net) Received: from sv1.eng.oar.net (sv1.eng.oar.net [192.148.251.86]) by mx1.freebsd.org (Postfix) with SMTP id 3741013C4EA for ; Tue, 18 Dec 2007 21:49:31 +0000 (UTC) (envelope-from maf@eng.oar.net) Received: (qmail 84735 invoked from network); 18 Dec 2007 21:49:30 -0000 Received: from dev1.eng.oar.net (HELO ?127.0.0.1?) (192.148.251.71) by sv1.eng.oar.net with SMTP; 18 Dec 2007 21:49:30 -0000 In-Reply-To: <20071217102433.GQ25053@tnn.dglawrence.com> References: <20071217102433.GQ25053@tnn.dglawrence.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Mark Fullmer Date: Tue, 18 Dec 2007 16:49:14 -0500 To: David G Lawrence X-Mailer: Apple Mail (2.752.3) Cc: freebsd-net@freebsd.org, freebsd-stable@freebsd.org Subject: Re: Packet loss every 30.999 seconds X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2007 21:49:31 -0000 A little progress. I have a machine with a KTR enabled kernel running. Another machine is running David's ffs_vfsops.c's patch. I left two other machines (GENERIC kernels) running the packet loss test overnight. At ~ 32480 seconds of uptime the problem starts. This is really close to a 16 bit overflow... See http://www.eng.oar.net/~maf/bsd6/ p1.png and http://www.eng.oar.net/~maf/bsd6/p2.png. The missing impulses at 31 second marks are the intervals between test runs. The window of missing packets (timestamps between two packets where a sequence number is missing) is usually less than 4us, altough I'm not sure gettimeofday() can be trusted for measuring this. See https://www.eng.oar.net/~maf/bsd6/ p3.png Things I'll try tonight: o check on the patched kernel o Try KTR debugging enabled before and after an expected high latency period. o Dump all files to /dev/null to trigger the behavior. I would expect the vnode problem to look a little different on the packet loss graphs over time. If this leads anywher I'll add a counter before the msleep() and see how often it's getting there. On Dec 17, 2007, at 5:24 AM, David G Lawrence wrote: > I noticed this as well some time ago. The problem has to do with > the > processing (syncing) of vnodes. When the total number of allocated > vnodes > in the system grows to tens of thousands, the ~31 second periodic sync > process takes a long time to run. Try this patch and let people > know if > it helps your problem. It will periodically wait for one tick (1ms) > every > 500 vnodes of processing, which will allow other things to run. > > Index: ufs/ffs/ffs_vfsops.c > =================================================================== > RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v > retrieving revision 1.290.2.16 > diff -c -r1.290.2.16 ffs_vfsops.c > *** ufs/ffs/ffs_vfsops.c 9 Oct 2006 19:47:17 -0000 1.290.2.16 > --- ufs/ffs/ffs_vfsops.c 25 Apr 2007 01:58:15 -0000 > *************** > *** 1109,1114 **** > --- 1109,1115 ---- > int softdep_deps; > int softdep_accdeps; > struct bufobj *bo; > + int flushed_count = 0; > > fs = ump->um_fs; > if (fs->fs_fmod != 0 && fs->fs_ronly != 0) { /* XXX */ > *************** > *** 1174,1179 **** > --- 1175,1184 ---- > allerror = error; > vput(vp); > MNT_ILOCK(mp); > + if (flushed_count++ > 500) { > + flushed_count = 0; > + msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1); > + } > } > MNT_IUNLOCK(mp); > /* > > -DG > > David G. Lawrence > President > Download Technologies, Inc. - http://www.downloadtech.com - (866) > 399 8500 > The FreeBSD Project - http://www.freebsd.org > Pave the road of life with opportunities.