Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 04 Jul 2008 00:54:37 -0400
From:      Paul <paul@gtcomm.net>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>, Ingo Flaschberger <if@xip.at>
Subject:   Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Message-ID:  <486DAD0D.8090604@gtcomm.net>
In-Reply-To: <486D35A0.4000302@gtcomm.net>
References:  <4867420D.7090406@gtcomm.net>	<20080701004346.GA3898@stlux503.dsto.defence.gov.au>	<alpine.LFD.1.10.0807010257570.19444@filebunker.xip.at>	<20080701010716.GF3898@stlux503.dsto.defence.gov.au>	<alpine.LFD.1.10.0807010308320.19444@filebunker.xip.at>	<486986D9.3000607@monkeybrains.net>	<48699960.9070100@gtcomm.net>	<ea7b9c170806302005n2a66f592h2127f87a0ba2c6d2@mail.gmail.com>	<20080701033117.GH83626@cdnetworks.co.kr>	<ea7b9c170806302050p2a3a5480t29923a4ac2d7c852@mail.gmail.com>	<4869ACFC.5020205@gtcomm.net>	<4869B025.9080006@gtcomm.net>	<486A7E45.3030902@gtcomm.net>	<486A8F24.5010000@gtcomm.net>	<486A9A0E.6060308@elischer.org>	<486B41D5.3060609@gtcomm.net>	<alpine.LFD.1.10.0807021052041.557@filebunker.xip.at>	<486B4F11.6040906@gtcomm.net>	<alpine.LFD.1.10.0807021155280.557@filebunker.xip.at>	<486BC7F5.5070604@gtcomm.net>	<20080703160540.W6369@delplex.bde.org>	<486C7F93.7010308@gtcomm.net> <20080703195521.O6973@delplex.bde.	org> <486D35A0.4000302@gtcomm.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Numbers are maximum with near 100% cpu usage and some errors occuring, 
just for testing.
FreeBSD  7.0-STABLE FreeBSD 7.0-STABLE #6: Thu Jul  3 19:32:38 CDT 
2008     root@foo:/usr/obj/usr/src/sys/ROUTER  amd64
CPU: Dual-Core AMD Opteron(tm) Processor 2222 (3015.47-MHz K8-class CPU)
NON-SMP KERNEL  em driver, intel 82571EB NICs
fastforwarding on, isr.direct on, ULE, Preemption (NOTE: Interesting 
thing, without preemption gets errors similar to polling)

64 bit.. 1.1mpps max with opteron 2222 one direction no routing table, 
no firewall -> em0 --> em3 ->
64 bit.. 700k max with opteron 2222 one direction no routing table, one 
ipfw rule -> em0 --> em3 ->
64 bit.. 500kpps max with opteron 2222 one direction no routing table, 
20 ipfw rule -> em0 --> em3 ->
64 bit.. 750kpps max with opteron 2222 one direction Full BGP (260k 
route) table -> em0 --> em3 ->
64 bit.. 400kpps max with opteron 2222 one direction no routing table, 2 
pf rules no state -> em0 --> em3 ->

using lagg driver in etherchannel with 2 ports (em0,em1) reduces the 
performance by about 8% which is strange as it shouldn't.
In SMP mode lagg driver reduces it substantially more, and this is where 
it should increase performance greatly because incoming
packets are load balanced over multiple NICs.. :/

32 bit test coming next, then I'm going with a high mhz Xeon or c2d proc 
45nm  and post those results (using same source tree/kernel/etc)

I tried polling, and I tried the polling patch that was posted to the 
list and both work but generate too many errors (missed packets).
Without polling the packet errors ONLY occur when the cpu is near 100% usage


Paul wrote:
> Opteron 2222 UP mode, no polling
>
>            input          (em0)           output
>   packets  errs      bytes    packets  errs      bytes colls
>   1071020     0   66403248          2     0        404     0
>   1049793     0   65087174          2     0        356     0
>   1040320     0   64499848          2     0        356     0
>   1049712     0   65082152          2     0        356     0
>   1039504     0   64449256          2     0        356     0
>    933118     0   57853324          2     0        356     0
>
> still has some cpu left and i can't generate any more packets
>
> Polling turned on provided better performance on 32 bit, but it gets 
> strange errors on 64 bit..
> Even at low pps I get small amounts of errors, and high pps same 
> thing.. you would think that if
> it got errors at low pps it would get more errors at high pps but that 
> isn't the case..
> Polling on:
> packets  errs      bytes    packets  errs      bytes colls
>    979736   963   60743636          1     0        226     0
>    991838   496   61493960          1     0        178     0
>    996125   460   61759754          1     0        178     0
>    979381   326   60721626          1     0        178     0
>   1022249   379   63379442          1     0        178     0
>    991468   557   61471020          1     0        178     0
>
> lowering pps a little.......
>           input          (em0)           output
>   packets  errs      bytes    packets  errs      bytes colls
>    818688   151   50758660          1     0        226     0
>    837920   179   51951044          1     0        178     0
>    826217   168   51225458          1     0        178     0
>    801017   100   49663058          1     0        178     0
>    761857   287   47235138          1     0        178     0
>
>
> what could cause this?
>
> If i'm going to use a uniprocessor mode system I NEED polling to work 
> because I have to have
> cpu cycles left over for userspace processes and I can't afford to 
> have it lock those out.
> SMP is no big deal if it actually worked..
>
> I'm going to do a SMP test with this cpu now with polling off/on and 
> then I'm going to apply the polling patch and try that.
>
>
>
> Bruce Evans wrote:
>> On Thu, 3 Jul 2008, Paul wrote:
>>
>>> Bruce Evans wrote:
>>>>> No polling:
>>>>> 843762 25337   52313248          1     0        178     0
>>>>>   763555     0   47340414          1     0        178     0
>>>>>   830189     0   51471722          1     0        178     0
>>>>>   838724     0   52000892          1     0        178     0
>>>>>   813594   939   50442832          1     0        178     0
>>>>>   807303   763   50052790          1     0        178     0
>>>>>   791024     0   49043492          1     0        178     0
>>>>>   768316  1106   47635596          1     0        178     0
>>>>> Machine is maxed and is unresponsive..
>>>>
>>>> That's the most interesting one.  Even 1% packet loss would probably
>>>> destroy performance, so the benchmarks that give 10-50% packet loss
>>>> are uninteresting.
>>>>
>>> But you realize that it's outputting all of these packets on em3  
>>> and I'm watching them coming out
>>> and they are consistent with the packets received on em0 that 
>>> netstat shows are 'good' packets.
>>
>> Well, output is easier.  I don't remember seeing the load on a taskq for
>> em3.  If there is a memory bottleneck, it might to might not be more 
>> related
>> to running only 1 taskq per interrupt, depending on how independent the
>> memory system is for different CPU.  I think Opterons have more 
>> indenpendence
>> here than most x86's.
>>
>>> I'm using a server opteron which supposedly has the best memory 
>>> performance out of any CPU right now.
>>> Plus opterons have the biggest l1 cache, but small l2 cache.  Do you 
>>> think larger l2 cache on the Xeon (6mb for 2 core) would be better?
>>> I have a 2222 opteron coming which is 1ghz faster so we will see 
>>> what happens
>>
>> I suspect lower latency memory would help more.  Big memory systems
>> have inherently higher latency.  My little old A64 workstation and
>> laptop have main memory latencies 3 times smaller than freebsd.org's
>> new Core2 servers according to lmbench2 (42 nsec for the overclocked
>> DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec).
>> If there are a lot of cache misses, then the extra 100 nsec can be
>> important.  Profiling of sendto() using hwpmc or perfmon shows a
>> significant number of cache misses per packet (2 or 10?).
>>
>>>>> Polling ON:
>>>>>         input          (em0)           output
>>>>>  packets  errs      bytes    packets  errs      bytes colls
>>>>>   784138 179079   48616564          1     0        226     0
>>>>>   788815 129608   48906530          2     0        356     0
>>>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ?  I'm 
>>>>> really mistified by this..
>>>>
>>>> Is this with hz=2000 and 256/256 and no polling in idle?  40% is easy
>>>> to explain (perhaps incorrectly).  Polling can then read at most 256
>>>> descriptors every 1/2000 second, giving a max throughput of 512 kpps.
>>>> Packets < descriptors in general but might be equal here (for small
>>>> packets).  You seem to actually get 784 kpps, which is too high even
>>>> in descriptors unless, but matches exactly if the errors are counted
>>>> twice (784 - 179 - 505 ~= 512).  CPU is getting short too, but 40%
>>>> still happens to be left over after giving up at 512 kpps.  Most of
>>>> the errors are probably handled by the hardware at low cost in CPU by
>>>> dropping packets.  There are other types of errors but none except
>>>> dropped packets is likely.
>>>>
>>> Read above, it's actually transmitting 770kpps out of em3 so it 
>>> can't just be 512kpps.
>>
>> Transmitting is easier, but with polling its even harder to send 
>> faster than
>> hz * queue_length than to receive.  This is without polling in idle.
>>
>>> I was thinking of trying 4 or 5.. but how would that work with this 
>>> new hardware?
>>
>> Poorly, except possibly with polling in FreeBSD-4.  FreeBSD-4 generally
>> has lower overheads and latency, but is missing important improvements
>> (mainly tcp optimizations in upper layers, better DMA and/or mbuf
>> handling, and support for newer NICs).  FreeBSD-5 is also missing the
>> overhead+latency advantage.
>>
>> Here are some benchmarks. (ttcp mainly tests sendto().  4.10 em needed a
>> 2-line change to support a not-so-new PCI em NIC.  Summary:
>> - my bge NIC can handle about 600 kpps on my faster machine, but only
>>   achieves 300 in 4.10 unpatched.
>> - my em NIC can handle about 400 kpps on my slower machine, except in
>>   later versions it can receive at about 600 kpps.
>> - only 6.x and later can achieve near wire throughput for 1500-MTU
>>   packets (81 kpps vs 76 kpps).  This depends on better DMA or mbuf
>>   handling...  I now remember the details -- it is mainly better mbuf
>>   handling: old versions split the 1500-MTU packets into 2 mbufs and
>>   this causes 2 descriptors per packet, which causes extra software
>>   overheads and even larger overheads for the hardware.
>>
>> %%%
>> Results of benchmarks run on 23 Feb 2007:
>>
>> my~5.2 bge --> ~4.10 em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     639     98    1660     398*     77      8k
>> ttcp -l5       -t     6.0    100    3960     6.0       6    5900
>> ttcp -l1472 -u -t      76     27     395      76      40      8k
>> ttcp -l1472    -t      51     40     11k      51      26      8k
>>
>> (*) Same as sender according to netstat -I, but systat -ip shows that
>>     almost half aren't delivered to upper layers.
>>
>> my~5.2 bge --> 4.11 em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     635     98    1650     399*     74      8k
>> ttcp -l5       -t     5.8    100    3900     5.8       6    5800
>> ttcp -l1472 -u -t      76     27     395      76      32      8k
>> ttcp -l1472    -t      51     40     11k      51      25      8k
>>
>> (*) Same as sender according to netstat -I, but systat -ip shows that
>>     almost half aren't delivered to upper layers.
>>
>> my~5.2 bge --> my~5.2 em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     638     98    1660     394*    100-     8k
>> ttcp -l5       -t     5.8    100    3900     5.8       9    6000
>> ttcp -l1472 -u -t      76     27     395      76      46      8k
>> ttcp -l1472    -t      51     40     11k      51      35      8k
>>
>> (*) Same as sender according to netstat -I, but systat -ip shows that
>>     almost half aren't delivered to upper layers.  With the em rate
>>     limit on ips changed from 8k to 80k, about 95% are delivered up.
>>
>> my~5.2 bge --> 6.2 em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     637     98    1660     637     100-    15k
>> ttcp -l5       -t     5.8    100    3900     5.8       8     12k
>> ttcp -l1472 -u -t      76     27     395      76      36     16k
>> ttcp -l1472    -t      51     40     11k      51      37     16k
>>
>> my~5.2 bge --> ~current em-fastintr
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     641     98    1670     641      99      8k
>> ttcp -l5       -t     5.9    100    2670     5.9       7      6k
>> ttcp -l1472 -u -t      76     27     395      76      35      8k
>> ttcp -l1472    -t      52     43     11k      52      30      8k
>>
>> ~6.2 bge --> ~current em-fastintr
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     309     62    1600     309      64      8k
>> ttcp -l5       -t     4.9    100    3000     4.9       6      7k
>> ttcp -l1472 -u -t      76     27     395      76      34      8k
>> ttcp -l1472    -t      54     28    6800      54      30      8k
>>
>> ~current bge --> ~current em-fastintr
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     602    100    1570     602      99      8k
>> ttcp -l5       -t     5.3    100    2660     5.3       5    5300
>> ttcp -l1472 -u -t      81#    19     212      81#     38      8k
>> ttcp -l1472    -t      53     34     11k      53      30      8k
>>
>> (#) Wire speed to within 0.5%.  This is the only kppps in this set of
>>     benchmarks that is close to wire speed.  Older kernels apparently
>>     lose relative to -current because mbufs for mtu-sized packets are
>>     not contiguous in older kernels.
>>
>> Old results:
>>
>> ~4.10 bge --> my~5.2 em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     n/a    n/a     n/a     346      79      8k
>> ttcp -l5       -t     n/a    n/a     n/a     5.4      10    6800
>> ttcp -l1472 -u -t     n/a    n/a     n/a      67      40      8k
>> ttcp -l1472    -t     n/a    n/a     n/a      51      36      8k
>>
>> ~4.10 kernel, =4 bge --> ~current em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     n/a    n/a     n/a     347      96     14k
>> ttcp -l5       -t     n/a    n/a     n/a     5.8      10     14k
>> ttcp -l1472 -u -t     n/a    n/a     n/a      67      62     14K
>> ttcp -l1472    -t     n/a    n/a     n/a      52      40     16k
>>
>> ~4.10 kernel, =4+ bge --> ~current em
>>                              tx                      rx
>>                      kpps   load%    ips    kpps    load%    ips
>> ttcp -l5    -u -t     n/a    n/a     n/a     627     100      9k
>> ttcp -l5       -t     n/a    n/a     n/a     5.6       9     13k
>> ttcp -l1472 -u -t     n/a    n/a     n/a      68      63     14k
>> ttcp -l1472    -t     n/a    n/a     n/a      54      44     16k
>> %%%
>>
>> %%%
>> Results of benchmarks run on 28 Dec 2007:
>>
>> ~5.2 epsplex (em) ttcp:
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:        825k    3 206k  229 412k     52.1  45.1   2.8
>> local with sink:      659k    3 263k  231 131k     66.5  27.3   6.2
>> tx remote no sink:     35k    3 273k 8237 266k     42.0  52.1   2.3   
>> 3.6
>> tx remote with sink:   26k    3 394k 8224  100     60.0  5.41   3.4  
>> 11.2
>> rx remote no sink:     25k    4   26 8237 373k     20.6  79.4   0.0   
>> 0.0
>> rx remote with sink:   30k    3 203k 8237 398k     36.5  60.7   2.8   
>> 0.0
>>
>> 6.3-PR besplex (em) ttcp:
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:        417k    1 208k 418k    2     49.5  48.5   2.0
>> local with sink:      420k    1 276k 145k    2     70.0  23.6   6.4
>> tx remote no sink:     19k    2 250k 8144    2     58.5  38.7   2.8   
>> 0.0
>> tx remote with sink:   16k    2 361k 8336    2     72.9  24.0   3.1   
>> 4.4
>> rx remote no sink:     429    3   49  888    2      0.3  99.33  0.0   
>> 0.4
>> tx remote with sink:   13k    2 316k 5385    2     31.7  63.8   3.6   
>> 0.8
>>
>> 8.0-C epsplex (em-fast) ttcp:
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:        442k    3 221k  230 442k     47.2  49.6   2.7
>> local with sink:      394k    3 262k  228 131k     72.1  22.6   5.3
>> tx remote no sink:     17k    3 226k 7832  100     94.1   0.2   3.0   
>> 0.0
>> tx remote with sink:   17k    3 360k 7962  100     91.7   0.2   3.7   
>> 4.4
>> rx remote no sink:     saturated -- cannot update systat display
>> rx remote with sink:   15k    6 358k 8224  100     97.0   0.0   2.5   
>> 0.5
>>
>> ~4.10 besplex (bge) ttcp:
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:          15    0 425k  228   11     96.3   0.0   3.7
>> local with sink:        **    0 622k  229   **     94.7   0.3   5.0
>> tx remote no sink:      29    1 490k 7024   11     47.9  29.8   4.4  
>> 17.9
>> tx remote with sink:    26    1 635k 1883   11     65.7  11.4   5.6  
>> 17.3
>> rx remote no sink:       5    1   68 7025    1      0.0  47.3   0.0  
>> 52.7
>> rx remote with sink:  6679    2 365k 6899   12     19.7  29.2   2.5  
>> 48.7
>>
>> ~5.2-C besplex (bge) ttcp:
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:          1M    3 271k  229 543k     50.7  46.8   2.5
>> local with sink:        1M    3 406k  229 203k     67.4  28.2   4.4
>> tx remote no sink:     49k    3 474k  11k 167k     52.3  42.7   5.0   
>> 0.0
>> tx remote with sink:  6371    3 641k 1900  100     76.0  16.8   6.2   
>> 0.9
>> rx remote no sink:     34k    3   25  11k 270k      0.8  65.4   0.0  
>> 33.8
>> rx remote with sink:   41k    3 365k  10k 370k     31.5  47.1   2.3  
>> 19.0
>>
>> 6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken):
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:        540k    0 270k 540k    0     50.5  46.0   3.5
>> local with sink:      628k    0 417k 210k    0     68.8  27.9   3.3
>> tx remote no sink:     15k    1 222k 7190    1     28.4  29.3   1.7  
>> 40.6
>> tx remote with sink:  5947    1 315k 2825    1     39.9  14.7   2.6  
>> 42.8
>> rx remote no sink:     13k    1   23 6943    0      0.3  49.5   0.2  
>> 50.0
>> rx remote with sink:   20k    1 371k 6819    0     29.5  30.1   3.9  
>> 36.5
>>
>> 8.0-C besplex (bge) ttcp:
>>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  
>> Idle
>> local no sink:        649k    3 324k  100 649k     53.9  42.9   3.2
>> local with sink:      649k    3 433k  100 216k     75.2  18.8   6.0
>> tx remote no sink:     24k    3 432k  10k  100     49.7  41.3   2.4   
>> 6.6
>> tx remote with sink:  3199    3 568k 1580  100     64.3  19.6   4.0  
>> 12.2
>> rx remote no sink:     20k    3   27  10k  100      0.0  46.1   0.0  
>> 53.9
>> rx remote with sink:   31k    3 370k  10k  100     30.7  30.9   4.8  
>> 33.5
>> %%%
>>
>> Bruce
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>>
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?486DAD0D.8090604>