Date: Fri, 7 May 2004 14:32:59 -0400 From: Gerrit Nagelhout <gnagelhout@sandvine.com> To: 'Scott Long' <scottl@freebsd.org>, Robert Watson <rwatson@freebsd.org>, 'John Baldwin' <jhb@FreeBSD.org>, 'Bruce Evans' <bde@zeta.org.au> Cc: freebsd-current@freebsd.org Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance Message-ID: <FE045D4D9F7AED4CBFF1B3B813C85337021AB397@mail.sandvine.com>
next in thread | raw e-mail | index | archive | help
Scott Long wrote: > Robert Watson wrote: > > On Fri, 7 May 2004, Brad Knowles wrote: > > > > > >>At 10:55 PM -0400 2004/05/06, Robert Watson wrote: > >> > >> > >>> On occasion, I've had conversations with Peter Wemm about > providing HAL > >>> modules with optimized versions of various common > routines for specific > >>> hardware platforms. However, that would require us to > make a trade-off > >>> between the performance benefits of inlining and the > performance benefits > >>> of a HAL module... > >> > >> I'm confused. Couldn't you just do this sort of stuff as > >>conditional macros, which would have both benefits? > > > > > > Well, the goal of introducing HAL modules would be that you > don't have to > > recompile the kernel in order to perform local hardware-specific > > optimization of low level routines. I.e., you could > substitute faster > > implementations of zeroing, synchronization, certain math > routines, etc > > based on the CPU discovered at run-time. While you can have switch > > statements, etc, it's faster just to relink the kernel to > use the better > > implementation for the available CPU. However, if you do > that, you still > > end up with the function call cost, which might well out-weight the > > benefits of specialization. > > > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects > > robert@fledge.watson.org Senior Research Scientist, > McAfee Research > > > > It really depends on how you link the HAL module in. Calling > indirectly > through function pointers is pretty darn slow, and I suspect that the > long pipeline of a P4 makes this even worse. Switching to a better > instruction might save you 20 cycles, but the indirect call to do it > might cost you 30 and that assumes that the branched > instruction stream > is still in the L1 cache and that twiddling %esp and %ebp gives no > pipeline stalls themselves. Even without the indirect jump, > all of the > housekeeping that goes into making a function call might > drown out most > benefits. The only way that this might make sense is if you move the > abstraction upwards and have it encompass more common code, or do some > sort of self-modifying code scheme early in the boot. The alternative > might be to have the HAL be a compile-time option like Brad hinted at. > > Scott The biggest problem I still see with all of this, is that even if I could compile the kernel for the P4, under SMP there is still no fast locking mechanism in place (that I am aware of, although I am researching that). I ran a few more tests and did some more calculations to determine the impact of (removing) mutexes, and here is what I found: for UP, I was able to get 850kpps, which is 3294 cycles/packet (at 2.8Ghz) for SMP, it was 500kpps, which is 5600 cycles/packet, or an additional 2306 cycles/packet, which presumably goes mostly towards the atomic locked operations. At ~120 cycles/lock extra for SMP, this means that there should be around 19 atomic operations per packet. After getting rid of one mutex (from IF_DEQUEUE, this is not safe, but fun to try), the performance went to 530kpps, or 5283 cycles/packet. This is a savings of ~317 cycles per packet. After a quick look through the bridge code path, I found the following atomic operations (I probably missed some, and might have some that don't always lock, but the total seems about right) em_process_receive_interrupts (EM_LOCK) bus_dma ? mb_alloc ? (MBP_PERSISTENT flag is set, where is this first locked?) bridge_in (BDG_LOCK) if_handoff (IF_LOCK) em_start (EM_LOCK) IF_DEQUEUE (IF_LOCK) m_free (atomic_cmpset_int) m_free (atomic_subtract_int) At 2 locks/mutex this adds up to about 16 atomic operations per packet. I think that some of the changes that Robert mentioned before about putting mbufs in a list before releasing the lock should help a lot for the Xeons. I am willing to try out some of these changes (both testing for performance, and making the actual code changes) because we can't switch over to 5.x until the performance is back up to where 4.7 was. Most of my experience with FreeBSD (from the last 1/2 year or so of looking at (and changing a few things) the code) is in the area of the low level network drivers (em) and some of the lower stack layers. This is why I have focused on the bridging data path to compare the performance. I must admit that I don't know exactly what code changes are going on in the stack, but if fine-grained locking means a (large) increase in the number of mutexes throughout the stack, I am quite concerned about the performance of the whole system on P4/Xeons. With fine-grained locking I think that the cost of individual functions will go up (a lot in the Xeons :( ), but the overal performance may still be better because multiple threads can do work simultaneously if there is nothing else for the other processors to do. What I am concerned about is that if you have a dual-xeon system with enough kernel (stack) work to keep one processor busy, and enough user-space work to keep the other 3 processors busy on 4.7, what will happen on 5.x? Thanks, Gerrit
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FE045D4D9F7AED4CBFF1B3B813C85337021AB397>