From owner-freebsd-performance@FreeBSD.ORG Fri Dec 15 14:18:04 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4F06C16A403 for ; Fri, 15 Dec 2006 14:18:04 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id E2EEC43CA1 for ; Fri, 15 Dec 2006 14:16:21 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 41B1C5BFC6B; Sat, 16 Dec 2006 01:18:02 +1100 (EST) Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 695D58C09; Sat, 16 Dec 2006 01:17:59 +1100 (EST) Date: Sat, 16 Dec 2006 01:17:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alan Amesbury In-Reply-To: <4581D185.7020702@umn.edu> Message-ID: <20061215232203.C3994@besplex.bde.org> References: <4581D185.7020702@umn.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@freebsd.org Subject: Re: Polling tuning and performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Dec 2006 14:18:04 -0000 On Thu, 14 Dec 2006, Alan Amesbury wrote: > ... > What I'm aiming for, of course, is zero packet loss. Realizing that's > probably impossible for this system given its load, I'm trying to do > what I can to minimize loss. > ... > * PREEMPTION disabled - /sys/conf/NOTES says this helps with > interactivity. I don't care about interactive performance > on this host. It's needed to prevent packet loss without polling. It probably makes little difference with polling (if the machines is mostly handling network traffic and that only by polling). > * Most importantly, HZ=1000, and DEVICE_POLLING and > AUTO_EOI_1 are included. (AUTO_EOI_1 was added because > /sys/amd64/conf/NOTES says this can save a few microseconds > on some interrupts. I'm not worried about suspend/resume, but > definitely want speed, so it got added. I don't believe in POLLING or HZ=1000, but recetly tested them with bge. I am unhappy to report that my fine-tuned interrupt handling still loses to polling by a few percent for efficiency. I am happy to report that polling loses to interrupt handling by a lot for correctness -- polling gives packet loss. Polling also loses big for latency with idle_poll and the system actually idle, when it wins a little. AUTO_EOI_1 has little effect unless the system gets lots of interrupt, so with most interrupts avoided by using polling it has little effect. > As mentioned above, this host is running FreeBSD/amd64, so there's no > need to remove support for I586_CPU, et al; that stuff was never there > in the first place. AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very unusual for amd64 so AUTO_EOI_1 probably has no effect for you. > As mentioned above, I've got HZ set to 1000. Per /sys/amd64/conf/NOTES, > I'd considered setting it to 2000, but have discovered previously that > FreeBSD's RFC1323 support breaks. I documented this on -hackers last year: > > http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html I think there are old PRs about this. Even 1000 is too large (?). > Since I've not seen word on a correction for this being added to > FreeBSD, I've limited HZ to 1000. HZ = 100 gives interesting behaviour. Of course, it doesn't work, since polling depends on polling often enough. Any particular value of HZ can only give polling often enough for a very limited range of systems. 1000 is apparently good for 100Mbps and not too bad for 1Gbps, provided the hardware has enough buffering, but with enough buffering polling is not really needed. > After reading polling(4) a couple times, I set kern.polling.burst_max to > 1000. The manpage says that "each interface can receive at most (HZ * > burst_max) packets per second", and the default setting is 150, which is > described as "adequate for 100Mbit network and HZ=1000." I figured, > "Hey, gigabit, how about ten times the default?" but that's prevented by > "#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c. I can (easily) generate only 250 kpps on input and had to increase kern.polling.burst_max to > 250 to avoid huge packet lossage at this rate. It doesn't seem to work right for output, since I can (easily) generate 340 kpps output and got that with a burst max of only 15 should have got only 150 kpps. Output is faster at the lowest level (but slower at higher levels), so doing larger bursts of output might be intentional. However, output at 340 kkps gives a system load of 100% on the test machine (which is not very fast or SMP). no matter how it is done (polling just makes it go 2% faster), so polling is not doing its main job of very well. Polling's main job is to prevent netowork activity from using 100% CPU. Large values of kern.polling.burst_max are fundamentally incompatible with polling doing this. On my test system, a burst max of 1000 combined with HZ = 1000 would just ask the driver alone to use 100% of the CPU doing 1000 kppps though a single device. "Fortunately", the device can't go that fast, so plenty of CPU is left. > In theory that might've been good enough, but polling(4) says that > kern.polling.burst is "[the] [m]aximum number of packets grabbed from > each network interface in each timer tick. This number is dynamically > adjusted by the kernel, according to the programmed user_frac, > burst_max, CPU speed, and system load." I keep seeing > kern.polling.burst hit a thousand, which leads me to believe that > kern.polling.burst_max needs to be higher. > > For example: > > secs since > epoch kern.polling.burst > ---------- ------------------ > 1166133997 1000 > ... Is it really dynamic? I see 1000's too, but for sending at only 340 kpps. Almost all bursts should have size 340. With a max of 150, burst is 150 too but 340 kpps are still sent. > Unfortunately, that appears to be only possible through a) patching > /sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000, > as indicated in one of the NOTES, which will effectively hose certain > TCP connectivity because of the RFC1323 breakage. Looked at another > way, both essentially require changes to source code, the former being > fairly obvious, and the latter requiring fixes to the RFC1323 support. > Either way, I think that's a bit beyond my abilities; I have NO > illusions about my kernel h4cking sk1llz. There may be a fix in an old PR. > Other possibly relevant data points: > > * System load hovers right around 1. Polling in idle eats all the CPU. Polling in idle is very wasteful (mainly of power) unless the system can rarely be idle anyway, but then polling in idle doesn't help much. > * The system has almost zero disk activity. > > * With polling off: > > - 'vmstat 5' consistently shows about 13K context switches > and ~6800 interrupts > - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286 > for bge1, and near zero for everything else > - CPU load drops to 0.4-0.8, but CPU idle time sits around 80% These are only small interrupt loads. bge always generates about 6667 interrupts per second (under all loads except none or tiny) because it is programmed to use interrupt moderation with a timeout of 150uS and some finer details. This gives behaviour very similar to polling at a frequency of 6667 Hz. The main differences between this and polling at 1000 Hz are: - 6667 Hz works better for correctness (lower latency, fewer dropped packets for missed polls) - 6667 Hz has higher overheads (only a few percent) - interrupts have lower overheads if nothing is happening so you don't actually get them at 6667 Hz - the polling given by interrupt moderation is dumb. It doesn't have any of the burst max controls, etc. (but could easily). It doesn't interact with other devices (but could uneasily). bge can be easily be reprogrammed to use interrupt moderation with a timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz. This immediately gives the main disadvantage of polling (latency of 1000uS unless polling in idle and the system is actually idle at least once every 1000uS). bge has internal (buffering) limits which have similar effects to the burst limit. The advantages of polling are not easily gained in this way (especially for rx). > * With polling on, kern.polling.burst_max=150: > > - kern.polling.burst holds at 150 > - 'vmstat 5' shows context switches hold around 2600, with > interrupts holding around 30K I think you mean `systat -vmstat 5'. The interrupt count here is bogus. It is mostly for software interrupts that mostly don't do much becuase they coalesce with old ones. Only ones that cause context switches are relevant, and there is no counter for those. Most of the context switches are to the poll routine (1000 there and 1000 back). > - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total > doesn't increase!), other rates stay the same (looks like > possible display bugs in 'vmstat -i' here!) Probably just averaging. > - CPU load holds at 1, but CPU idle time usually stays >95% I saw heavy polling reduce the idle time significantly here. I think the CPU idle time can be very biased here under light loads. The times shown by top(1) are unbiased. > * With polling on, kern.polling.burst_max=1000: > > - kern.polling.burst is frequently 1000 and almost always >850 > - 'vmstat 5' shows context switches unchanged, but interrupts > are 150K-190K > - 'vmstat -i' unchanged from burst_max=150 > - CPU load and CPU idle time very similar to burst_max=150 > > So, with all that in mind..... Any ideas for improvement? Apologies in > advance for missing the obvious. 'dmesg' and kernel config are attached. Sorry, no ideas about tuning polling parameters (I don't know them well since I don't believe in polling :-). You apparently have eveything tuned almost as well as possible, and the only possibilities for future improvments are avoiding the 5% (?) extra overhead for !polling and the packet loss for polling. I see the folowing packet loss for polling with HZ=1000, burst_max=300, idle_poll=1: %%% input (bge0) output packets errs bytes packets errs bytes colls 242999 1 14579940 0 0 0 0 235496 0 14129760 0 0 0 0 236930 3261 14215800 0 0 0 0 237816 3400 14268960 0 0 0 0 240418 3211 14425080 0 0 0 0 %%% The packet losses of 3+K always occur when I hit Caps Lock. This also happens without polling unless PREEMPTION is configuered. It is caused by low-quality code for setting the LED for Caps Lock combined with thread priorities and or their scheduling not working right. In the interrupt-driven case, the thread priorities are correct (bgeintr > syscons) and configuring PREEMPTION fixes the schedulng. In the polling case, the thread priorities are apparently incorrect. Polling probably needs to have its own thread running at the same priority as bgeintr (> syscons), but I think it mainly uses the network SWI thread (< syscons). With idle_poll=1, it also uses its idlepoll thread, but that has very low priority so it cannot help in cases like this. The code for setting LEDs busy-waits for several mS which is several polling periods. It must be about 13mS to lose 3200 packets when packets are arriving at 240 kpps. With a network server you won't be hitting Caps Lock a lot but have to worry about other low-quality interrupt handlers busy-waiting for several mS. The loss of a single packet in the above happens more often than I can explain: - with polling, it happens a lot - without polling but with PREEMPTION, it happens a lot when I press Caps Lock but not otherwise. THe problem might not be packet loss. bge has separate statistics for packet loss but the net layer counts all intput errors together. Bruce