From owner-freebsd-net@FreeBSD.ORG Thu Jul 3 07:26:21 2008 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 793691065673 for ; Thu, 3 Jul 2008 07:26:21 +0000 (UTC) (envelope-from paul@gtcomm.net) Received: from atlas.gtcomm.net (atlas.gtcomm.net [67.215.15.242]) by mx1.freebsd.org (Postfix) with ESMTP id 300948FC23 for ; Thu, 3 Jul 2008 07:26:21 +0000 (UTC) (envelope-from paul@gtcomm.net) Received: from c-76-108-179-28.hsd1.fl.comcast.net ([76.108.179.28] helo=[192.168.1.6]) by atlas.gtcomm.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1KEJ9J-0007nD-It; Thu, 03 Jul 2008 03:22:37 -0400 Message-ID: <486C7F93.7010308@gtcomm.net> Date: Thu, 03 Jul 2008 03:28:19 -0400 From: Paul User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: Bruce Evans References: <4867420D.7090406@gtcomm.net> <200806301944.m5UJifJD081781@lava.sentex.ca> <20080701004346.GA3898@stlux503.dsto.defence.gov.au> <20080701010716.GF3898@stlux503.dsto.defence.gov.au> <486986D9.3000607@monkeybrains.net> <48699960.9070100@gtcomm.net> <20080701033117.GH83626@cdnetworks.co.kr> <4869ACFC.5020205@gtcomm.net> <4869B025.9080006@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <486B4F11.6040906@gtcomm.net> <486BC7F5.5070604@gtcomm.net> <20080703160540.W6369@delplex.bde.org> In-Reply-To: <20080703160540.W6369@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net , Ingo Flaschberger Subject: Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jul 2008 07:26:21 -0000 Bruce Evans wrote: > On Wed, 2 Jul 2008, Paul wrote: > >> ... >> -----------Reboot with 4096/4096........(my guess is that it will be >> a lot worse, more errors..) >> ........ >> Without polling, 4096 is horrible, about 200kpps less ... :/ >> Turning on polling.. >> polling on, 4096 is bad, >> input (em0) output >> packets errs bytes packets errs bytes colls >> 622379 307753 38587506 1 0 178 0 >> 635689 277303 39412718 1 0 178 0 >> ... >> ------Rebooting with 256/256 descriptors.......... >> .......... >> No polling: >> 843762 25337 52313248 1 0 178 0 >> 763555 0 47340414 1 0 178 0 >> 830189 0 51471722 1 0 178 0 >> 838724 0 52000892 1 0 178 0 >> 813594 939 50442832 1 0 178 0 >> 807303 763 50052790 1 0 178 0 >> 791024 0 49043492 1 0 178 0 >> 768316 1106 47635596 1 0 178 0 >> Machine is maxed and is unresponsive.. > > That's the most interesting one. Even 1% packet loss would probably > destroy performance, so the benchmarks that give 10-50% packet loss > are uninteresting. > But you realize that it's outputting all of these packets on em3 and I'm watching them coming out and they are consistent with the packets received on em0 that netstat shows are 'good' packets. > All indications are that you are running out of CPU and memory (DMA > and/or cache fills) throughput. The above apparently hits both limits > at the same time, while with more descriptors memory throughput runs > out first. 1 CPU is apparently barely enough for 800 kpps (is this > all with UP now?), and I think more CPUs could only be slower, as you > saw with SMP, especially using multiple em taskqs, since memory traffic > would be higher. I wouldn't expect this to be fixed soon (except by > throwing better/different hardware at it). > > The CPU/DMA balance can probably be investigated by slowing down the CPU/ > memory system. > I'm using a server opteron which supposedly has the best memory performance out of any CPU right now. Plus opterons have the biggest l1 cache, but small l2 cache. Do you think larger l2 cache on the Xeon (6mb for 2 core) would be better? I have a 2222 opteron coming which is 1ghz faster so we will see what happens :> My NIC is PCI-E 4x so there's no bottleneck there. > You may remember my previous mail about getting higher pps on bge. > Again, all indications are that I'm running out of CPU, memory, and > bus throughput too since the bus is only PCI 33MHz. These interact > in a complicated way which I haven't been able to untangle. -current > is fairly consistently slower than my ~5.2 by about 10%, apparently > due to code bloat (extra CPU and related extra cache misses). OTOH, > like you I've seen huge variations for changes that should be null > (e.g., disturbing the alignment of the text section without changing > anything else). My ~5.2 is very consistent since I rarely change it, > while -current changes a lot and shows more variation, but with no > sign of getting near the ~5.2 plateau or even its old peaks. > >> Polling ON: >> input (em0) output >> packets errs bytes packets errs bytes colls >> 784138 179079 48616564 1 0 226 0 >> 788815 129608 48906530 2 0 356 0 >> 755555 142997 46844426 2 0 468 0 >> 803670 144459 49827544 1 0 178 0 >> 777649 147120 48214242 1 0 178 0 >> 779539 146820 48331422 1 0 178 0 >> 786201 148215 48744478 2 0 356 0 >> 776013 101660 48112810 1 0 178 0 >> 774239 145041 48002834 2 0 356 0 >> 771774 102969 47850004 1 0 178 0 >> >> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm >> really mistified by this.. > > Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy > to explain (perhaps incorrectly). Polling can then read at most 256 > descriptors every 1/2000 second, giving a max throughput of 512 kpps. > Packets < descriptors in general but might be equal here (for small > packets). You seem to actually get 784 kpps, which is too high even > in descriptors unless, but matches exactly if the errors are counted > twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% > still happens to be left over after giving up at 512 kpps. Most of > the errors are probably handled by the hardware at low cost in CPU by > dropping packets. There are other types of errors but none except > dropped packets is likely. > Read above, it's actually transmitting 770kpps out of em3 so it can't just be 512kpps. I suppose multiple packets can fit in 1 descriptor? I am using VERY small tcp packets.. >> Every time it maxes out and gets errors, top reports: >> CPU: 0.0% user, 0.0% nice, 10.1% system, 45.3% interrupt, 44.6% idle >> pretty much the same line every time >> >> 256/256 blows away 4096 , probably fits the descriptors into the >> cache lines on the cpu and 4096 has too many cache misses and causes >> worse performance. > > Quite likely. Maybe your systems have memory systems that are weak > relative > to other resources, so that they this limit sooner than expected. > > I should look at my "fixes" for bge, one than changes rxd from 256 to > 512, > and one that increases the ifq tx length from txd = 512 to about 20000. > Both of these might thrash caches. The former makes little difference > except for polling at < 4000 Hz, but I don't believe in or use polling. > The latter works around select() for write descriptors not working on > sockets, so that high frequency polling from userland is not needed to > determine a good time to retry after ENOBUFs errors. This is probably > only important in pps benchmarks. txd = 512 gives good efficiency in > my version of bge, but might be too high for good throughput and is > mostly > wasted in distribution versions of FreeBSD. > I was thinking of trying 4 or 5.. but how would that work with this new hardware? Thanks Paul