Date: Mon, 6 Sep 2004 16:15:38 -0400 From: "Gerrit Nagelhout" <gnagelhout@sandvine.com> To: <current@freebsd.org>, "Scott Long" <scottl@freebsd.org>, "Robert Watson" <rwatson@freebsd.org> Cc: Alex Hoff <ahoff@sandvine.com> Subject: FreeBSD 5.3 Bridge performance take II Message-ID: <A8535F8D62F3644997E91F4F66E341FC1F1CA5@exchange.sandvine.com> Resent-Message-ID: <20040908112837.A59291@pooker.samsco.org>
next in thread | raw e-mail | index | archive | help
Hi,
I have just finished some profiling and analysis of the FREEBSD_5_BP =
code=20
running a standard 4-port ethernet bridge (not netgraph). On the =
upside,=20
some of the features such as the netperf stuff, MUTEX_PROFILING and=20
UMA are very cool, and (I think) give the potential for a really fast =
bridge=20
(or similar application). However, the current performance is still =
rather=20
poor compared to 4.x, but I think that with the groundwork now in place, =
and
some minor changes and a couple of new features, it can be made much =
much faster.
I would like to discuss some possible optimizations (will suggest some =
below), and
then we are willing to take on some of them, and give the code back to =
FreeBSD.
Hopefully these changes can be made on RELENG_5 to be used by by 5.4.
The tests that I have run so far have focussed on the different between=20
running in polling mode (dual 2.8Ghz Xeon, 2 2-port em NICs) versus =
interrupt=20
mode (with debug.mpsafenet=3D1, and no INVARIANTS/WITNESS or anything=20
like that). In both setups I actually get similar throughput (300kpps =
total in=20
and out divided evenly over the 4 ports). I think it should be possible =
to
get >> 1Mpps bridging on this platform.
In the polling case, there is still only one active thread, and the =
limiting
factor seems to be simply the number of mutexes (11 per packet
according to MUTEX_PROFILING), and overhead from UMA, bus_dma, etc. =20
With polling disabled, I think the fact that PREEMPTION was disabled (I =
can't even
boot with it on), and some sub-optimal mutex usage resulting in a lot
of collisions caused some problems, even though in theory all 4 cores =
should
be able to run simultaneously.
Here is a sample profile (while in polling mode). The cpu idle, halt =
etc are simply
indicating that 3 of the cores have nothing to do. But it does give a =
pretty
good sense of where all time is being spent. There are definitely a lot =
of cycles
going to UMA, mutexes, etc. (This profile only shows the top functions, =
and has the call tree disabled ... ie only interrupt based profiling =
because it slows
the test down too much otherwise).
% cumulative self self total
time seconds seconds calls ms/call ms/call name
18.4 10.25 10.25 cpu_idle_default =
[1]
13.8 17.94 7.69 cpu_idle [2]
6.5 21.57 3.63 critical_exit [3]
6.5 25.17 3.61 _mtx_lock_spin [4]
5.0 27.95 2.78 uma_zalloc_arg [5]
4.6 30.52 2.56 cpu_halt [6]
4.4 32.94 2.43 uma_zfree_arg [7]
3.9 35.12 2.18 maybe_preempt [8]
3.2 36.91 1.79 bridge_in [9]
2.8 38.46 1.55 =
em_process_receive_interrupts [10]
2.6 39.89 1.43 =
_bus_dmamap_load_buffer [11]
2.3 41.19 1.30 bdg_forward [12]
2.3 42.48 1.29 mb_free_ext [13]
1.8 43.49 1.01 malloc_type_freed =
[14]
1.7 44.44 0.95 ether_input [15]
1.7 45.39 0.94 em_start [16]
1.7 46.33 0.94 _bus_dmamap_sync =
[17]
1.5 47.18 0.84 em_start_locked =
[18]
1.2 47.85 0.68 =
malloc_type_zone_allocated [19]
1.2 48.52 0.67 __mcount [20]
1.2 49.17 0.65 mb_ctor_pack [21]
1.1 49.80 0.63 em_encap [22]
1.1 50.39 0.59 free [23]
1.0 50.94 0.56 =
bus_dmamap_load_mbuf [24]
0.9 51.46 0.51 generic_bzero [25]
0.9 51.96 0.50 m_freem [26]
0.8 52.42 0.46 generic_bcopy [27]
0.7 52.79 0.38 em_get_buf [28]
0.6 53.13 0.34 =
em_clean_transmit_interrupts [29]
0.5 53.42 0.29 bus_dmamap_load =
[30]
0.4 53.66 0.24 m_adj [31]
0.4 53.90 0.23 malloc [32]
0.4 54.11 0.22 bus_dmamap_create =
[33]
0.2 54.24 0.12 bus_dmamem_free =
[35]
0.2 54.35 0.11 mb_dtor_pack [36]
0.2 54.45 0.10 em_tx_cb [37]
0.2 54.54 0.09 =
em_receive_checksum [38]
0.1 54.61 0.08 em_dmamap_cb [39]
0.1 54.69 0.07 m_tag_delete_chain =
[40]
0.1 54.75 0.07 _bus_dmamap_unload =
[41]
0.1 54.82 0.06 em_poll [42]
0.1 54.88 0.06 =
em_transmit_checksum_setup [43]
0.1 54.93 0.05 bus_dmamap_destroy =
[44]
0.1 54.97 0.04 _mtx_lock_sleep =
[47]
0.1 55.00 0.03 if_start [49]
0.1 55.03 0.03 =
bus_dmamap_load_uio [50]
0.1 55.07 0.03 75189 0.00 0.00 netisr_poll [51]
0.1 55.10 0.03 em_smartspeed [52]
0.1 55.13 0.03 ithread_loop [34]
Here are the (top) results of the mutex profiling (these are basically =
all the locks
that get called once or twice per packet):
max total count avg cnt_hold cnt_lock name
24344 37552473 309134 121 151712 101781 if_em.c:956 (em5) (1)
31578 10548396 309131 34 44233 81751 if_em.c:3432 (em4) (2)
460 5813698 620705 9 16 79 uma_core.c:1800 (UMA pcpu) (3)
428 4304975 619846 6 26 24 uma_core.c:2206 (UMA pcpu) (4)
445 3129168 309127 10 30828 28115 bridge.c:1201 (em5) (5)
462 3125131 309127 10 125294 122560 bridge.c:816 (bridge) (6)
489 2815715 309134 9 14610 20050 if_em.c:926 (em5) (7)
450 2573019 309170 8 94471 101577 kern_malloc.c:185 (devbuf) (8)
419 2113089 309275 6 67982 65871 kern_malloc.c:210 (devbuf) (9)
The line numbers will be close to RELENG_5_BP code but not exactly the =
same=20
because of some local modifications, so here are the descriptions of the =
mutexes=20
involved:
1) em_start (used for transmit)
2) em_process_receive_interrupts (re-lock just after if_input)
3) uma_zalloc_arg (per CPU lock)
4) uma_zfree_arg (per CPU lock)
5) bdb_forward (IFQ_HANDOFF)
6) bridge_in (global bridge lock)
7) em_start_locked (IF_DEQUEUE)
8) malloc_type_zone_allocated
9) malloc_type_freed
>From these numbers, the uma locks seem to get called twice for every =
packet,=20
but have no collisions. All other locks have significant collision =
problems resulting
in a lot of overhead.
Based on these stats, I have come up with the following =
observations/suggestions/etc
that I would like to discuss.
As discussed before, there is a significant cost associated with every =
mutex. I'd
like to be able to get down to less than 1 mutex per packet (on average) =
through this
path. Some of the possibilities to do this are:
- Implement workQ's of packets (also suggested by Robert Watson in the =
past). This
will reduce the mutexes in number 1, 2, 5, 6 & 7 above because it should =
be possible
to only take the lock for a queue of packets, instead of every one.
- Implement device level caching for the UMA mbuf zones. If a driver =
could allocate
one bucket of mbufs at a time, no locking would be required per =
allocation. The same
goes for the free side of things, if you can allocate an empty bucket, =
fill it up, and then
return it, only a couple of mutexes are required per bucket. This would =
also reduce
the function call overhead for every packet. This change should =
actually get rid
of most of the remaining mutex overhead.
I think that one of the major reasons that polling with one thread had =
about the same
performance as interrupts with 4 threads/cores is that some of the =
mutexes are held
far too long, thus reducing parallelism. The biggest culprit of this is =
in the em driver.
First of all, there is only one global lock for the driver, but there =
should be no reason
that the rx & tx paths couldn't be run simultanously. If we setup =
something like:
EM_TX_LOCK()
EM_TX_UNLOCK()
EM_RX_LOCK()
EM_RX_UNLOCK()
EM_LOCK() {EM_TX_LOCK(); EM_RX_LOCK()}
EM_UNLOCK() {EM_TX_UNLOCK(); EM_RX_UNLOCK()}
this driver will run much faster. Even within the receive and transmit =
functions,=20
the mutexes are held for a long time. It should be possible to code in =
such a way
that the mutex is released before trying to free or allocate an mbuf. =
This should
reduce the holding time and thus collisions a lot.
When overloading the bridge in interrupt mode, the system becomes =
completely
unresponsive (can't even get into ddb) until the packet source is =
removed. This is
highly undesirable behaviour, but currently the only way to use multiple =
kernel=20
threads to handle the workload.
Extending polling to use multiple threads instead of one should work =
around this
problem. This is a bit of a design in itself, and probably worthy of a =
separate=20
discussion. We are certainly willing to give this a shot (hopefully =
with with some
external input)
The latest generation Xeons (Nocona) have a couple of new features that =
are
very useful for optimizing code. One of them is the ability to prefetch =
a cache line
for which a page is not yet in the tlb. It should be possible to =
strategically sprinkle
a few prefetches in the code, and get a big performance boost. This is =
probably
pretty platform specific though, so I don't know how to do this in =
general because
it will only benefit some platforms (don't know about AMD/alpha), and =
may slightly
hurt some others.
In terms of cache efficiency, I am not sure that using the UMA mbuf =
packet zone
is the best way to go. To be able to put a cluster on a DMA descriptor, =
you=20
currently need to read the mbuf header to get its pointer. It may be =
more efficient
to have the local cache of just clusters and mbufs. To allocate a =
cluster you=20
just need to read the bucket array, and can add the cluster to the =
descriptor without
having anything but the array itself in cache. Once the packet is =
filled up, it can
be coupled to an mbuf header. The other advantage of this is that =
pointers for
both are always easily available in an array, they lend themselves well =
to s/w=20
prefetching.
The choice of schedulers, and use of PREEMPTION will probably make a bit =
of a=20
difference for these tests too, but I did not do much experimentation =
because I=20
couldn't even boot with the ULE scheduler & PREEMPTION enabled. I =
suspect
that preemption will help quite a bit when there are mutex collisions.
This is all I have for now. As I mentioned previously, I'd like to =
generate some=20
discussion on some of these points, as well as hear ideas for additional =
optimizations.
We will definitely implement some of these features ourselves, but would =
much
rather give back the code and make this a "cooperative effort".
Also, I haven't done any testing on the netgraph side of things yet, but =
that will
probably be next on the list.
Comments?
Thanks,
Gerrit Nagelhout
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A8535F8D62F3644997E91F4F66E341FC1F1CA5>
