Date: Tue, 18 Dec 2007 16:43:40 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: David G Lawrence <dg@dglawrence.com> Cc: freebsd-net@freebsd.org, freebsd-stable@freebsd.org Subject: Re: Packet loss every 30.999 seconds Message-ID: <20071218155642.D32807@delplex.bde.org> In-Reply-To: <20071217102433.GQ25053@tnn.dglawrence.com> References: <D50B5BA8-5A80-4370-8F20-6B3A531C2E9B@eng.oar.net> <20071217102433.GQ25053@tnn.dglawrence.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 17 Dec 2007, David G Lawrence wrote: >> While trying to diagnose a packet loss problem in a RELENG_6 snapshot >> dated >> November 8, 2007 it looks like I've stumbled across a broken driver or >> kernel routine which stops interrupt processing long enough to severly >> degrade network performance every 30.99 seconds. I see the same behaviour under a heavily modified version of FreeBSD-5.2 (except the period was 2 ms longer and the latency was 7 ms instead of 11 ms when numvnodes was at a certain value. Now with numvnodes = 17500, the latency is 3 ms. > I noticed this as well some time ago. The problem has to do with the > processing (syncing) of vnodes. When the total number of allocated vnodes > in the system grows to tens of thousands, the ~31 second periodic sync > process takes a long time to run. Try this patch and let people know if > it helps your problem. It will periodically wait for one tick (1ms) every > 500 vnodes of processing, which will allow other things to run. However, the syncer should be running at a relative low priority and not cause packet loss. I don't see any packet loss even in ~5.2 where the network stack (but not drivers) is still Giant-locked. Other too-high latencies showed up: - syscons LED setting and vt switching gives a latency of 5.5 msec because syscons still uses busy-waiting for setting LEDs :-(. Oops, I do see packet loss -- this causes it under ~5.2 but not under -current. For the bge and/or em drivers, the packet loss shows up in netstat output as a few hundred errors for every LED setting on the receiving machine, while receiving tiny packets at the maximum possible rate of 640 kpps. sysctl is completely Giant-locked and so are upper layers of the network stack. The bge hardware rx ring size is 256 in -current and 512 in ~5.2. At 640 kpps, 512 packets take 800 us so bge wants to call the the upper layers with a latency of far below 800 us. I don't know exactly where the upper layers block on Giant. - a user CPU hog process gives a latency of over 200 ms every half a second or so when the hog starts up, and a 300-400 ms after the hog has been running for some time. Two user CPU hog processes double the latency. Reducing kern.sched.quantum from 100 ms to 10 ms and/or renicing the hogs don't seem to affect this. Running the hogs at idle priority fixes this. This won't affect packet loss, but it might affect user network processes -- they might need to run at real time priority to get low enough latency. They might need to do this anyway -- a scheduling quantum of 100 ms should give a latency of 100 ms per CPU hog quite often, though not usually since the hogs should never be prefered to a higher-prioerity process. Previously I've used a less specialized clock-watching program to determine the syscall latency. It showed similar problems for CPU hogs. I just remembered that I found the fix for these under ~5.2 -- remove a local hack that sacrifices latency for reduced context switches between user threads. -current with SCHED_4BSD does this non-hackishly, but seems to have a bug somehwhere that gives a latency that is large enough to be noticeable in interactive programs. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071218155642.D32807>