From owner-freebsd-net@FreeBSD.ORG Thu Jul 3 12:48:52 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 21D1A10656CB for ; Thu, 3 Jul 2008 12:48:52 +0000 (UTC) (envelope-from paul@gtcomm.net) Received: from atlas.gtcomm.net (atlas.gtcomm.net [67.215.15.242]) by mx1.freebsd.org (Postfix) with ESMTP id BA5D78FC1B for ; Thu, 3 Jul 2008 12:48:51 +0000 (UTC) (envelope-from paul@gtcomm.net) Received: from c-76-108-179-28.hsd1.fl.comcast.net ([76.108.179.28] helo=[192.168.1.6]) by atlas.gtcomm.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1KEOBO-0004BE-Ke; Thu, 03 Jul 2008 08:45:06 -0400 Message-ID: <486CCB29.3080308@gtcomm.net> Date: Thu, 03 Jul 2008 08:50:49 -0400 From: Paul User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: Bruce Evans References: <4867420D.7090406@gtcomm.net> <200806301944.m5UJifJD081781@lava.sentex.ca> <20080701004346.GA3898@stlux503.dsto.defence.gov.au> <20080701010716.GF3898@stlux503.dsto.defence.gov.au> <486986D9.3000607@monkeybrains.net> <48699960.9070100@gtcomm.net> <20080701033117.GH83626@cdnetworks.co.kr> <4869ACFC.5020205@gtcomm.net> <4869B025.9080006@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <486B4F11.6040906@gtcomm.net> <486BC7F5.5070604@gtcomm.net> <20080703160540.W6369@delplex.bde.org> <486C7F93.7010308@gtcomm.net> <20080703195521.O6973@delplex.bde. org> In-Reply-To: <20080703195521.O6973@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net , Ingo Flaschberger Subject: Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jul 2008 12:48:52 -0000 Bruce Evans wrote: > On Thu, 3 Jul 2008, Paul wrote: > >> Bruce Evans wrote: >>>> No polling: >>>> 843762 25337 52313248 1 0 178 0 >>>> 763555 0 47340414 1 0 178 0 >>>> 830189 0 51471722 1 0 178 0 >>>> 838724 0 52000892 1 0 178 0 >>>> 813594 939 50442832 1 0 178 0 >>>> 807303 763 50052790 1 0 178 0 >>>> 791024 0 49043492 1 0 178 0 >>>> 768316 1106 47635596 1 0 178 0 >>>> Machine is maxed and is unresponsive.. >>> >>> That's the most interesting one. Even 1% packet loss would probably >>> destroy performance, so the benchmarks that give 10-50% packet loss >>> are uninteresting. >>> >> But you realize that it's outputting all of these packets on em3 and >> I'm watching them coming out >> and they are consistent with the packets received on em0 that netstat >> shows are 'good' packets. > > Well, output is easier. I don't remember seeing the load on a taskq for > em3. If there is a memory bottleneck, it might to might not be more > related > to running only 1 taskq per interrupt, depending on how independent the > memory system is for different CPU. I think Opterons have more > indenpendence > here than most x86's. > Opterons have on cpu memory controller.. That should help a little. :P But I must be getting more than 1 packet per descriptor because I can do HZ=100 and still get it without polling.. idle polling helps in all cases of polling that I have tested it with, seems moreso on 32 bit >> I'm using a server opteron which supposedly has the best memory >> performance out of any CPU right now. >> Plus opterons have the biggest l1 cache, but small l2 cache. Do you >> think larger l2 cache on the Xeon (6mb for 2 core) would be better? >> I have a 2222 opteron coming which is 1ghz faster so we will see what >> happens > > I suspect lower latency memory would help more. Big memory systems > have inherently higher latency. My little old A64 workstation and > laptop have main memory latencies 3 times smaller than freebsd.org's > new Core2 servers according to lmbench2 (42 nsec for the overclocked > DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec). > If there are a lot of cache misses, then the extra 100 nsec can be > important. Profiling of sendto() using hwpmc or perfmon shows a > significant number of cache misses per packet (2 or 10?). > The opterons are 667mhz DDR2 [registered], I have a Xeon that is ddr3 but i think the latency is higher than ddr2. I'll look up those programs you mentioned and see If I can run some tests. >>>> Polling ON: >>>> input (em0) output >>>> packets errs bytes packets errs bytes colls >>>> 784138 179079 48616564 1 0 226 0 >>>> 788815 129608 48906530 2 0 356 0 >>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm >>>> really mistified by this.. >>> >>> Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy >>> to explain (perhaps incorrectly). Polling can then read at most 256 >>> descriptors every 1/2000 second, giving a max throughput of 512 kpps. >>> Packets < descriptors in general but might be equal here (for small >>> packets). You seem to actually get 784 kpps, which is too high even >>> in descriptors unless, but matches exactly if the errors are counted >>> twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% >>> still happens to be left over after giving up at 512 kpps. Most of >>> the errors are probably handled by the hardware at low cost in CPU by >>> dropping packets. There are other types of errors but none except >>> dropped packets is likely. >>> >> Read above, it's actually transmitting 770kpps out of em3 so it can't >> just be 512kpps. > > Transmitting is easier, but with polling its even harder to send > faster than > hz * queue_length than to receive. This is without polling in idle. > What i'm saying though, it that it's not giving up at 512kpps because 784kpps is coming in em0 and going out em3 so obviously it's reading more than 256 every 1/2000th of a second (packets). What would be the best settings (theoretical) for 1mpps processing? I actually don't have a problem 'receiving' more than 800kpps with much lower CPU usage if it's going to blackhole . so obviously it can receive a lot more, maybe even line rate pps but i can't generate that much. >> I was thinking of trying 4 or 5.. but how would that work with this >> new hardware? > > Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally > has lower overheads and latency, but is missing important improvements > (mainly tcp optimizations in upper layers, better DMA and/or mbuf > handling, and support for newer NICs). FreeBSD-5 is also missing the > overhead+latency advantage. > > Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a > 2-line change to support a not-so-new PCI em NIC. Summary: > - my bge NIC can handle about 600 kpps on my faster machine, but only > achieves 300 in 4.10 unpatched. > - my em NIC can handle about 400 kpps on my slower machine, except in > later versions it can receive at about 600 kpps. > - only 6.x and later can achieve near wire throughput for 1500-MTU > packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf > handling... I now remember the details -- it is mainly better mbuf > handling: old versions split the 1500-MTU packets into 2 mbufs and > this causes 2 descriptors per packet, which causes extra software > overheads and even larger overheads for the hardware. > > %%% > Results of benchmarks run on 23 Feb 2007: > > my~5.2 bge --> ~4.10 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 639 98 1660 398* 77 8k > ttcp -l5 -t 6.0 100 3960 6.0 6 5900 > ttcp -l1472 -u -t 76 27 395 76 40 8k > ttcp -l1472 -t 51 40 11k 51 26 8k > > (*) Same as sender according to netstat -I, but systat -ip shows that > almost half aren't delivered to upper layers. > > my~5.2 bge --> 4.11 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 635 98 1650 399* 74 8k > ttcp -l5 -t 5.8 100 3900 5.8 6 5800 > ttcp -l1472 -u -t 76 27 395 76 32 8k > ttcp -l1472 -t 51 40 11k 51 25 8k > > (*) Same as sender according to netstat -I, but systat -ip shows that > almost half aren't delivered to upper layers. > > my~5.2 bge --> my~5.2 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 638 98 1660 394* 100- 8k > ttcp -l5 -t 5.8 100 3900 5.8 9 6000 > ttcp -l1472 -u -t 76 27 395 76 46 8k > ttcp -l1472 -t 51 40 11k 51 35 8k > > (*) Same as sender according to netstat -I, but systat -ip shows that > almost half aren't delivered to upper layers. With the em rate > limit on ips changed from 8k to 80k, about 95% are delivered up. > > my~5.2 bge --> 6.2 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 637 98 1660 637 100- 15k > ttcp -l5 -t 5.8 100 3900 5.8 8 12k > ttcp -l1472 -u -t 76 27 395 76 36 16k > ttcp -l1472 -t 51 40 11k 51 37 16k > > my~5.2 bge --> ~current em-fastintr > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 641 98 1670 641 99 8k > ttcp -l5 -t 5.9 100 2670 5.9 7 6k > ttcp -l1472 -u -t 76 27 395 76 35 8k > ttcp -l1472 -t 52 43 11k 52 30 8k > > ~6.2 bge --> ~current em-fastintr > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 309 62 1600 309 64 8k > ttcp -l5 -t 4.9 100 3000 4.9 6 7k > ttcp -l1472 -u -t 76 27 395 76 34 8k > ttcp -l1472 -t 54 28 6800 54 30 8k > > ~current bge --> ~current em-fastintr > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t 602 100 1570 602 99 8k > ttcp -l5 -t 5.3 100 2660 5.3 5 5300 > ttcp -l1472 -u -t 81# 19 212 81# 38 8k > ttcp -l1472 -t 53 34 11k 53 30 8k > > (#) Wire speed to within 0.5%. This is the only kppps in this set of > benchmarks that is close to wire speed. Older kernels apparently > lose relative to -current because mbufs for mtu-sized packets are > not contiguous in older kernels. > > Old results: > > ~4.10 bge --> my~5.2 em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t n/a n/a n/a 346 79 8k > ttcp -l5 -t n/a n/a n/a 5.4 10 6800 > ttcp -l1472 -u -t n/a n/a n/a 67 40 8k > ttcp -l1472 -t n/a n/a n/a 51 36 8k > > ~4.10 kernel, =4 bge --> ~current em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t n/a n/a n/a 347 96 14k > ttcp -l5 -t n/a n/a n/a 5.8 10 14k > ttcp -l1472 -u -t n/a n/a n/a 67 62 14K > ttcp -l1472 -t n/a n/a n/a 52 40 16k > > ~4.10 kernel, =4+ bge --> ~current em > tx rx > kpps load% ips kpps load% ips > ttcp -l5 -u -t n/a n/a n/a 627 100 9k > ttcp -l5 -t n/a n/a n/a 5.6 9 13k > ttcp -l1472 -u -t n/a n/a n/a 68 63 14k > ttcp -l1472 -t n/a n/a n/a 54 44 16k > %%% > > %%% > Results of benchmarks run on 28 Dec 2007: > > ~5.2 epsplex (em) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 825k 3 206k 229 412k 52.1 45.1 2.8 > local with sink: 659k 3 263k 231 131k 66.5 27.3 6.2 > tx remote no sink: 35k 3 273k 8237 266k 42.0 52.1 2.3 3.6 > tx remote with sink: 26k 3 394k 8224 100 60.0 5.41 3.4 11.2 > rx remote no sink: 25k 4 26 8237 373k 20.6 79.4 0.0 0.0 > rx remote with sink: 30k 3 203k 8237 398k 36.5 60.7 2.8 0.0 > > 6.3-PR besplex (em) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 417k 1 208k 418k 2 49.5 48.5 2.0 > local with sink: 420k 1 276k 145k 2 70.0 23.6 6.4 > tx remote no sink: 19k 2 250k 8144 2 58.5 38.7 2.8 0.0 > tx remote with sink: 16k 2 361k 8336 2 72.9 24.0 3.1 4.4 > rx remote no sink: 429 3 49 888 2 0.3 99.33 0.0 0.4 > tx remote with sink: 13k 2 316k 5385 2 31.7 63.8 3.6 0.8 > > 8.0-C epsplex (em-fast) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 442k 3 221k 230 442k 47.2 49.6 2.7 > local with sink: 394k 3 262k 228 131k 72.1 22.6 5.3 > tx remote no sink: 17k 3 226k 7832 100 94.1 0.2 3.0 0.0 > tx remote with sink: 17k 3 360k 7962 100 91.7 0.2 3.7 4.4 > rx remote no sink: saturated -- cannot update systat display > rx remote with sink: 15k 6 358k 8224 100 97.0 0.0 2.5 0.5 > > ~4.10 besplex (bge) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 15 0 425k 228 11 96.3 0.0 3.7 > local with sink: ** 0 622k 229 ** 94.7 0.3 5.0 > tx remote no sink: 29 1 490k 7024 11 47.9 29.8 4.4 17.9 > tx remote with sink: 26 1 635k 1883 11 65.7 11.4 5.6 17.3 > rx remote no sink: 5 1 68 7025 1 0.0 47.3 0.0 52.7 > rx remote with sink: 6679 2 365k 6899 12 19.7 29.2 2.5 48.7 > > ~5.2-C besplex (bge) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 1M 3 271k 229 543k 50.7 46.8 2.5 > local with sink: 1M 3 406k 229 203k 67.4 28.2 4.4 > tx remote no sink: 49k 3 474k 11k 167k 52.3 42.7 5.0 0.0 > tx remote with sink: 6371 3 641k 1900 100 76.0 16.8 6.2 0.9 > rx remote no sink: 34k 3 25 11k 270k 0.8 65.4 0.0 33.8 > rx remote with sink: 41k 3 365k 10k 370k 31.5 47.1 2.3 19.0 > > 6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken): > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 540k 0 270k 540k 0 50.5 46.0 3.5 > local with sink: 628k 0 417k 210k 0 68.8 27.9 3.3 > tx remote no sink: 15k 1 222k 7190 1 28.4 29.3 1.7 40.6 > tx remote with sink: 5947 1 315k 2825 1 39.9 14.7 2.6 42.8 > rx remote no sink: 13k 1 23 6943 0 0.3 49.5 0.2 50.0 > rx remote with sink: 20k 1 371k 6819 0 29.5 30.1 3.9 36.5 > > 8.0-C besplex (bge) ttcp: > Csw Trp Sys Int Sof Sys Intr User Idle > local no sink: 649k 3 324k 100 649k 53.9 42.9 3.2 > local with sink: 649k 3 433k 100 216k 75.2 18.8 6.0 > tx remote no sink: 24k 3 432k 10k 100 49.7 41.3 2.4 6.6 > tx remote with sink: 3199 3 568k 1580 100 64.3 19.6 4.0 12.2 > rx remote no sink: 20k 3 27 10k 100 0.0 46.1 0.0 53.9 > rx remote with sink: 31k 3 370k 10k 100 30.7 30.9 4.8 33.5 > %%% > > Bruce >