From owner-freebsd-net@FreeBSD.ORG Thu Jul 3 07:07:32 2008 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8873E106567E for ; Thu, 3 Jul 2008 07:07:32 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 0E9B28FC1A for ; Thu, 3 Jul 2008 07:07:31 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m6377Pwl020704 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 Jul 2008 17:07:26 +1000 Date: Thu, 3 Jul 2008 17:07:23 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Paul In-Reply-To: <486BC7F5.5070604@gtcomm.net> Message-ID: <20080703160540.W6369@delplex.bde.org> References: <4867420D.7090406@gtcomm.net> <200806301944.m5UJifJD081781@lava.sentex.ca> <20080701004346.GA3898@stlux503.dsto.defence.gov.au> <20080701010716.GF3898@stlux503.dsto.defence.gov.au> <486986D9.3000607@monkeybrains.net> <48699960.9070100@gtcomm.net> <20080701033117.GH83626@cdnetworks.co.kr> <4869ACFC.5020205@gtcomm.net> <4869B025.9080006@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <486B4F11.6040906@gtcomm.net> <486BC7F5.5070604@gtcomm.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: FreeBSD Net , Ingo Flaschberger Subject: Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jul 2008 07:07:32 -0000 On Wed, 2 Jul 2008, Paul wrote: >... > -----------Reboot with 4096/4096........(my guess is that it will be a lot > worse, more errors..) > ........ > Without polling, 4096 is horrible, about 200kpps less ... :/ > Turning on polling.. > polling on, 4096 is bad, > input (em0) output > packets errs bytes packets errs bytes colls > 622379 307753 38587506 1 0 178 0 > 635689 277303 39412718 1 0 178 0 > ... > ------Rebooting with 256/256 descriptors.......... > .......... > No polling: > 843762 25337 52313248 1 0 178 0 > 763555 0 47340414 1 0 178 0 > 830189 0 51471722 1 0 178 0 > 838724 0 52000892 1 0 178 0 > 813594 939 50442832 1 0 178 0 > 807303 763 50052790 1 0 178 0 > 791024 0 49043492 1 0 178 0 > 768316 1106 47635596 1 0 178 0 > Machine is maxed and is unresponsive.. That's the most interesting one. Even 1% packet loss would probably destroy performance, so the benchmarks that give 10-50% packet loss are uninteresting. All indications are that you are running out of CPU and memory (DMA and/or cache fills) throughput. The above apparently hits both limits at the same time, while with more descriptors memory throughput runs out first. 1 CPU is apparently barely enough for 800 kpps (is this all with UP now?), and I think more CPUs could only be slower, as you saw with SMP, especially using multiple em taskqs, since memory traffic would be higher. I wouldn't expect this to be fixed soon (except by throwing better/different hardware at it). The CPU/DMA balance can probably be investigated by slowing down the CPU/ memory system. You may remember my previous mail about getting higher pps on bge. Again, all indications are that I'm running out of CPU, memory, and bus throughput too since the bus is only PCI 33MHz. These interact in a complicated way which I haven't been able to untangle. -current is fairly consistently slower than my ~5.2 by about 10%, apparently due to code bloat (extra CPU and related extra cache misses). OTOH, like you I've seen huge variations for changes that should be null (e.g., disturbing the alignment of the text section without changing anything else). My ~5.2 is very consistent since I rarely change it, while -current changes a lot and shows more variation, but with no sign of getting near the ~5.2 plateau or even its old peaks. > Polling ON: > input (em0) output > packets errs bytes packets errs bytes colls > 784138 179079 48616564 1 0 226 0 > 788815 129608 48906530 2 0 356 0 > 755555 142997 46844426 2 0 468 0 > 803670 144459 49827544 1 0 178 0 > 777649 147120 48214242 1 0 178 0 > 779539 146820 48331422 1 0 178 0 > 786201 148215 48744478 2 0 356 0 > 776013 101660 48112810 1 0 178 0 > 774239 145041 48002834 2 0 356 0 > 771774 102969 47850004 1 0 178 0 > > Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really > mistified by this.. Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy to explain (perhaps incorrectly). Polling can then read at most 256 descriptors every 1/2000 second, giving a max throughput of 512 kpps. Packets < descriptors in general but might be equal here (for small packets). You seem to actually get 784 kpps, which is too high even in descriptors unless, but matches exactly if the errors are counted twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% still happens to be left over after giving up at 512 kpps. Most of the errors are probably handled by the hardware at low cost in CPU by dropping packets. There are other types of errors but none except dropped packets is likely. > Every time it maxes out and gets errors, top reports: > CPU: 0.0% user, 0.0% nice, 10.1% system, 45.3% interrupt, 44.6% idle > pretty much the same line every time > > 256/256 blows away 4096 , probably fits the descriptors into the cache lines > on the cpu and 4096 has too many cache misses and causes worse performance. Quite likely. Maybe your systems have memory systems that are weak relative to other resources, so that they this limit sooner than expected. I should look at my "fixes" for bge, one than changes rxd from 256 to 512, and one that increases the ifq tx length from txd = 512 to about 20000. Both of these might thrash caches. The former makes little difference except for polling at < 4000 Hz, but I don't believe in or use polling. The latter works around select() for write descriptors not working on sockets, so that high frequency polling from userland is not needed to determine a good time to retry after ENOBUFs errors. This is probably only important in pps benchmarks. txd = 512 gives good efficiency in my version of bge, but might be too high for good throughput and is mostly wasted in distribution versions of FreeBSD. Bruce