Date: Wed, 11 Jun 2014 01:46:31 +0400 From: "Alexander V. Chernikov" <melifaro@FreeBSD.org> To: Bryan Venteicher <bryanv@daemoninthecloset.org>, current@freebsd.org, net@freebsd.org Subject: Re: dhclient sucks cpu usage... Message-ID: <53977CB7.9010904@FreeBSD.org> In-Reply-To: <20140610185626.GK31367@funkthat.com> References: <20140610000246.GW31367@funkthat.com> <100488220.4292.1402369436876.JavaMail.root@daemoninthecloset.org> <5396CD41.2080300@FreeBSD.org> <20140610162443.GD31367@funkthat.com> <5397415B.5070409@FreeBSD.org> <20140610185626.GK31367@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format. --------------040100020408040101050505 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 10.06.2014 22:56, John-Mark Gurney wrote: > Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 21:33 +0400: >> On 10.06.2014 20:24, John-Mark Gurney wrote: >>> Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 >>> +0400: >>>> On 10.06.2014 07:03, Bryan Venteicher wrote: >>>>> Hi, >>>>> >>>>> ----- Original Message ----- >>>>>> So, after finding out that nc has a stupidly small buffer size (2k >>>>>> even though there is space for 16k), I was still not getting as good >>>>>> as performance using nc between machines, so I decided to generate some >>>>>> flame graphs to try to identify issues... (Thanks to who included a >>>>>> full set of modules, including dtraceall on memstick!) >>>>>> >>>>>> So, the first one is: >>>>>> https://www.funkthat.com/~jmg/em.stack.svg >>>>>> >>>>>> As I was browsing around, the em_handle_que was consuming quite a bit >>>>>> of cpu usage for only doing ~50MB/sec over gige.. Running top -SH shows >>>>>> me that the taskqueue for em was consuming about 50% cpu... Also pretty >>>>>> high for only 50MB/sec... Looking closer, you'll see that bpf_mtap is >>>>>> consuming ~3.18% (under ether_nh_input).. I know I'm not running >>>>>> tcpdump >>>>>> or anything, but I think dhclient uses bpf to be able to inject packets >>>>>> and listen in on them, so I kill off dhclient, and instantly, the >>>>>> taskqueue >>>>>> thread for em drops down to 40% CPU... (transfer rate only marginally >>>>>> improves, if it does) >>>>>> >>>>>> I decide to run another flame graph w/o dhclient running: >>>>>> https://www.funkthat.com/~jmg/em.stack.nodhclient.svg >>>>>> >>>>>> and now _rxeof drops from 17.22% to 11.94%, pretty significant... >>>>>> >>>>>> So, if you care about performance, don't run dhclient... >>>>>> >>>>> Yes, I've noticed the same issue. It can absolutely kill performance >>>>> in a VM guest. It is much more pronounced on only some of my systems, >>>>> and I hadn't tracked it down yet. I wonder if this is fallout from >>>>> the callout work, or if there was some bpf change. >>>>> >>>>> I've been using the kludgey workaround patch below. >>>> Hm, pretty interesting. >>>> dhclient should setup proper filter (and it looks like it does so: >>>> 13:10 [0] m@ptichko s netstat -B >>>> Pid Netif Flags Recv Drop Match Sblen Hblen Command >>>> 1224 em0 -ifs--l 41225922 0 11 0 0 dhclient >>>> ) >>>> see "match" count. >>>> And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for >>>> each consumer on interface). >>>> It should not introduce significant performance penalties. >>> Don't forget that it has to process the returning ack's... So, you're >> Well, it can be still captured with the proper filter like "ip && udp && >> port 67 or port 68". >> We're using tcpdump on high packet ratios (>1M) and it does not >> influence process _much_. >> We should probably convert its rwlock to rmlock and use per-cpu counters >> for statistics, but that's a different story. >>> looking around 10k+ pps that you have to handle and pass through the >>> filter... That's a lot of packets to process... >>> >>> Just for a bit more "double check", instead of using the HD as a >>> source, I used /dev/zero... I ran a netstat -w 1 -I em0 when >>> running the test, and I was getting ~50.7MiB/s w/ dhclient running and >>> then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I >>> launched dhclient again, and it dropped back to ~50MiB/s... >> dhclient uses different BPF sockets for reading and writing (and it >> moves write socket to privileged child process via fork(). >> The problem we're facing with is the fact that dhclient does not set >> _any_ read filter on write socket: >> 21:27 [0] zfscurr0# netstat -B >> Pid Netif Flags Recv Drop Match Sblen Hblen Command >> 1529 em0 --fs--l 86774 86769 86784 4044 3180 dhclient >> --------------------------------------- ^^^^^ -------------------------- >> 1526 em0 -ifs--l 86789 0 1 0 0 dhclient >> >> so all traffic is pushed down introducing contention on BPF descriptor >> mutex. >> >> (That's why I've asked for netstat -B output.) >> >> Please try an attached patch to fix this. This is not the right way to >> fix this, we'd better change BPF behavior not to attach to interface >> readers for write-only consumers. >> This have been partially implemented as net.bpf.optimize_writers hack, >> but it does not work for all direct BPF consumers (which are not using >> pcap(3) API). > > Ok, looks like this patch helps the issue... > > netstat -B; sleep 5; netstat -B: > Pid Netif Flags Recv Drop Match Sblen Hblen Command > 958 em0 --fs--l 3880000 14 35 3868 2236 dhclient > 976 em0 -ifs--l 3880014 0 1 0 0 dhclient > Pid Netif Flags Recv Drop Match Sblen Hblen Command > 958 em0 --fs--l 4178525 14 35 3868 2236 dhclient > 976 em0 -ifs--l 4178539 0 1 0 0 dhclient > > and now the rate only drops from ~66MiB/s to ~63MiB/s when dhclient is > running... Still a significant drop (5%), but better than before... Interesting. Can you provide some traces (pmc or dtrace ones)? I'm unsure if this will help, but it's worth trying: please revert my previous patch, apply an attached kernel patch, reboot, set net.bpf.optimize_writers to 1 and try again? > --------------040100020408040101050505 Content-Type: text/x-patch; name="bpf_optimize.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="bpf_optimize.diff" Index: sys/net/bpf.c =================================================================== --- sys/net/bpf.c (revision 266306) +++ sys/net/bpf.c (working copy) @@ -44,7 +44,7 @@ __FBSDID("$FreeBSD$"); #include <sys/types.h> #include <sys/param.h> #include <sys/lock.h> -#include <sys/rwlock.h> +#include <sys/rmlock.h> #include <sys/systm.h> #include <sys/conf.h> #include <sys/fcntl.h> @@ -643,6 +643,20 @@ bpf_attachd(struct bpf_d *d, struct bpf_if *bp) } /* + * Check if filter looks like 'set snaplen' + * program used by pcap-bpf.c:pcap_activate_bpf() + */ +static void +bpf_check_upgrade(u_long cmd, struct bpf_insn *fcode, u_int flen, int *is_snap) +{ + + if (cmd == BIOCSETF && flen == 1 && fcode[0].code == (BPF_RET | BPF_K)) + *is_snap = 1; + else + *is_snap = 0; +} + +/* * Add d to the list of active bp filters. * Reuqires bpf_attachd() to be called before */ @@ -1728,7 +1742,7 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, #endif size_t size; u_int flen; - int need_upgrade; + int is_snaplen, need_upgrade; #ifdef COMPAT_FREEBSD32 switch (cmd) { @@ -1755,6 +1769,7 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, #ifdef BPF_JITTER jfunc = ofunc = NULL; #endif + is_snaplen = 0; need_upgrade = 0; /* @@ -1773,6 +1788,8 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, free(fcode, M_BPF); return (EINVAL); } + /* Try to guess if this is snaplen cmd */ + bpf_check_upgrade(cmd, fcode, flen, &is_snaplen); #ifdef BPF_JITTER /* Filter is copied inside fcode and is perfectly valid. */ jfunc = bpf_jitter(fcode, flen); @@ -1807,11 +1824,27 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, * Do not require upgrade by first BIOCSETF * (used to set snaplen) by pcap_open_live(). */ - if (d->bd_writer != 0 && --d->bd_writer == 0) - need_upgrade = 1; - CTR4(KTR_NET, "%s: filter function set by pid %d, " - "bd_writer counter %d, need_upgrade %d", - __func__, d->bd_pid, d->bd_writer, need_upgrade); + if (d->bd_writer != 0) { + if (is_snaplen == 0) { + /* + * We're probably using bpf directly. + * Upgrade immediately. + */ + need_upgrade = 1; + } else if (--d->bd_writer == 0) { + /* + * First snaplen filter has already + * been set. This is probably catch-all + * filter + */ + need_upgrade = 1; + } + CTR5(KTR_NET, + "%s: filter function set by pid %d, " + "bd_writer counter %d, snap %d upgrade %d", + __func__, d->bd_pid, d->bd_writer, + is_snaplen, need_upgrade); + } } } BPFD_UNLOCK(d); @@ -1842,6 +1875,7 @@ bpf_setif(struct bpf_d *d, struct ifreq *ifr) { struct bpf_if *bp; struct ifnet *theywant; + BPFIF_TRACKER; BPF_LOCK_ASSERT(); @@ -2038,6 +2072,7 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktl #endif u_int slen; int gottime; + BPFIF_TRACKER; gottime = BPF_TSTAMP_NONE; @@ -2105,6 +2140,7 @@ bpf_mtap(struct bpf_if *bp, struct mbuf *m) #endif u_int pktlen, slen; int gottime; + BPFIF_TRACKER; /* Skip outgoing duplicate packets. */ if ((m->m_flags & M_PROMISC) != 0 && m->m_pkthdr.rcvif == NULL) { @@ -2158,6 +2194,7 @@ bpf_mtap2(struct bpf_if *bp, void *data, u_int dle struct bpf_d *d; u_int pktlen, slen; int gottime; + BPFIF_TRACKER; /* Skip outgoing duplicate packets. */ if ((m->m_flags & M_PROMISC) != 0 && m->m_pkthdr.rcvif == NULL) { @@ -2477,7 +2514,7 @@ bpfattach2(struct ifnet *ifp, u_int dlt, u_int hdr LIST_INIT(&bp->bif_wlist); bp->bif_ifp = ifp; bp->bif_dlt = dlt; - rw_init(&bp->bif_lock, "bpf interface lock"); + rm_init(&bp->bif_lock, "bpf interface lock"); KASSERT(*driverp == NULL, ("bpfattach2: driverp already initialized")); *driverp = bp; @@ -2582,7 +2619,7 @@ bpf_ifdetach(void *arg __unused, struct ifnet *ifp LIST_REMOVE(bp, bif_next); - rw_destroy(&bp->bif_lock); + rm_destroy(&bp->bif_lock); free(bp, M_BPF); nmatched++; @@ -2696,6 +2733,7 @@ bpf_zero_counters(void) { struct bpf_if *bp; struct bpf_d *bd; + BPFIF_TRACKER; BPF_LOCK(); LIST_FOREACH(bp, &bpf_iflist, bif_next) { @@ -2760,6 +2798,7 @@ bpf_stats_sysctl(SYSCTL_HANDLER_ARGS) int index, error; struct bpf_if *bp; struct bpf_d *bd; + BPFIF_TRACKER; /* * XXX This is not technically correct. It is possible for non Index: sys/net/bpf.h =================================================================== --- sys/net/bpf.h (revision 266306) +++ sys/net/bpf.h (working copy) @@ -1260,7 +1260,7 @@ struct bpf_if { u_int bif_dlt; /* link layer type */ u_int bif_hdrlen; /* length of link header */ struct ifnet *bif_ifp; /* corresponding interface */ - struct rwlock bif_lock; /* interface lock */ + struct rmlock bif_lock; /* interface lock */ LIST_HEAD(, bpf_d) bif_wlist; /* writer-only list */ int flags; /* Interface flags */ #endif Index: sys/net/bpfdesc.h =================================================================== --- sys/net/bpfdesc.h (revision 266306) +++ sys/net/bpfdesc.h (working copy) @@ -152,10 +152,11 @@ struct xbpf_d { u_int64_t bd_spare[4]; }; -#define BPFIF_RLOCK(bif) rw_rlock(&(bif)->bif_lock) -#define BPFIF_RUNLOCK(bif) rw_runlock(&(bif)->bif_lock) -#define BPFIF_WLOCK(bif) rw_wlock(&(bif)->bif_lock) -#define BPFIF_WUNLOCK(bif) rw_wunlock(&(bif)->bif_lock) +#define BPFIF_TRACKER struct rm_priotracker tracker +#define BPFIF_RLOCK(bif) rm_rlock(&(bif)->bif_lock, &tracker) +#define BPFIF_RUNLOCK(bif) rm_runlock(&(bif)->bif_lock, &tracker) +#define BPFIF_WLOCK(bif) rm_wlock(&(bif)->bif_lock) +#define BPFIF_WUNLOCK(bif) rm_wunlock(&(bif)->bif_lock) #define BPFIF_FLAG_DYING 1 /* Reject new bpf consumers */ Index: sys/kern/subr_witness.c =================================================================== --- sys/kern/subr_witness.c (revision 266306) +++ sys/kern/subr_witness.c (working copy) @@ -550,7 +550,7 @@ static struct witness_order_list_entry order_lists * BPF */ { "bpf global lock", &lock_class_mtx_sleep }, - { "bpf interface lock", &lock_class_rw }, + { "bpf interface lock", &lock_class_rm }, { "bpf cdev lock", &lock_class_mtx_sleep }, { NULL, NULL }, /* --------------040100020408040101050505--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53977CB7.9010904>