Date: Tue, 4 May 2004 18:17:39 -0400 From: Gerrit Nagelhout <gnagelhout@sandvine.com> To: freebsd-current@freebsd.org Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance Message-ID: <FE045D4D9F7AED4CBFF1B3B813C85337021AB377@mail.sandvine.com>
next in thread | raw e-mail | index | archive | help
>>>>I would like to move to CURRENT for new hardware support, and the >>>>ability to properly use multi-threading in user-space, but can't do >>>>this until the performance bottlenecks are solved. I realize that >>>>5.x is still a work in progress and hasn't been tuned as well as 4.7 >>>>yet, but are there any plans for optimizations in this area? Does >>>>anyone have any suggestions on what else I can try? >>> >>> >>>Try rwatson's netperf patches: >>> >>> http://www.watson.org/~robert/freebsd/netperf/ >>> >>>There is at least one outstanding panic condition known, but more >>>testing will be a great help. >>> >>>Kris >>> >>>P.S. You didn't mention the status of WITNESS, but I'm assuming you >>>read the docs and disabled it since it's a huge performance killer. > > >>WITNESS and INVARIANTS are turned off for the 5.2.1 release bits. >>However, the debug.mpsafenet sysctl is also turned off. Turning this >>on might give a significant performance boost for bridging. > > >>Scott > > > Thanks for all the responses so far. WITNESS is definitely disabled, > as are the other INVARIANTS. I had a look through the netperf patches, > but I don't think they will affect bridging very much. They seem be > directed more towards the socket layer and above. > > I still think that one of the bigger bottlenecks is the cost of all > the mutexes in SMP mode, and some of the new bus_dma and mbuf code that > was introduced. > > With previous platforms I have worked on (vxWorks), we had similar > issues, and ended up pushing buckets of packets through the data path, > so each mutex was only taken once for every 10-100 packets. > > Also, polling is currently done by only one CPU at a time. If this > were changed to have multiple threads poll multiple devices at the > same time, the performance should become much better. > > Thanks, > > Gerrit >You are correct about the netperf patches being directed towards the >socket layer. The IP stack and below was locked for 5.2, but the >benefits won't be seen unless you turn on debug.mpsafenet. During >the 5.2 development cycle I believe that benchmarking was done that >showed that mpsafenet bridging was significantly faster than non- >mpsafenet, and nearly as fast as 4.x if not a little faster. >I'd be interest to know more about your comments about polling from >multiple CPUs. Did you have a thread bound to each CPU, and did >each thread poll every interface, or only an exclusive subset of the >interfaces? >Scott >I tried enabling debug.mpsafenet, but it didn't make any difference. >Which parts of the bridging path do you think should be faster with >that enabled? >I haven't actually tried implementing polling from multiple CPUs, but >suggested it because I think it would help performance for certain >applications (such as bridging). What I would probably do >(without having given this a great deal of thought) is to: >1) Have a variable controlling how many threads to use for polling >2) Either lock an interface to a thread, or have interfaces switch > between threads depending on their load dynamically. >One obvious problem with this approach will be mutex contention >between threads. Even though the source interface would be owned >by a thread, the destination would likely be owned by a different >thread. I'm assuming that with the current mutex setup, only one >thread can receive from or transmit to an interface at a time. >Before this becomes feasible though, the cost of the mutexes should >be addressed first (assuming that is the current bottleneck for SMP) >Gerrit I ran the following fragment of code to determine the cost of a LOCK & UNLOCK on both UP and SMP: #define EM_LOCK(_sc) mtx_lock(&(_sc)->mtx) #define EM_UNLOCK(_sc) mtx_unlock(&(_sc)->mtx) unsigned int startTime, endTime, delta; startTime = rdtsc(); for (i = 0; i < 100; i++) { EM_LOCK(adapter); EM_UNLOCK(adapter); } endTime = rdtsc(); delta = endTime - startTime; printf("delta %u start %u end %u \n", (unsigned int)delta, startTime, endTime); On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per LOCK&UNLOCK, and dividing by 100) under UP, and ~300 cycles for SMP. Assuming 10 locks for every packet(which is conservative), at 500Kpps, this accounts for: 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles) Any comments? Thanks, Gerrit
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FE045D4D9F7AED4CBFF1B3B813C85337021AB377>