Date: Thu, 29 Aug 2013 04:49:31 -0700 From: Adrian Chadd <adrian@freebsd.org> To: "Alexander V. Chernikov" <melifaro@yandex-team.ru> Cc: Luigi Rizzo <luigi@freebsd.org>, Andre Oppermann <andre@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, FreeBSD Net <net@freebsd.org>, "Andrey V. Elsukov" <ae@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org> Subject: Re: Network stack changes Message-ID: <CAJ-Vmo=N=HnZVCD41ZmDg2GwNnoa-tD0J0QLH80x=f7KA5d%2BUg@mail.gmail.com> In-Reply-To: <521E41CB.30700@yandex-team.ru> References: <521E41CB.30700@yandex-team.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, There's a lot of good stuff to review here, thanks! Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to keep locking things like that on a per-packet basis. We should be able to do this in a cleaner way - we can defer RX into a CPU pinned taskqueue and convert the interrupt handler to a fast handler that just schedules that taskqueue. We can ignore the ithread entirely here. What do you think? Totally pie in the sky handwaving at this point: * create an array of mbuf pointers for completed mbufs; * populate the mbuf array; * pass the array up to ether_demux(). For vlan handling, it may end up populating its own list of mbufs to push up to ether_demux(). So maybe we should extend the API to have a bitmap of packets to actually handle from the array, so we can pass up a larger array of mbufs, note which ones are for the destination and then the upcall can mark which frames its consumed. I specifically wonder how much work/benefit we may see by doing: * batching packets into lists so various steps can batch process things rather than run to completion; * batching the processing of a list of frames under a single lock instance - eg, if the forwarding code could do the forwarding lookup for 'n' packets under a single lock, then pass that list of frames up to inet_pfil_hook() to do the work under one lock, etc, etc. Here, the processing would look less like "grab lock and process to completion" and more like "mark and sweep" - ie, we have a list of frames that we mark as needing processing and mark as having been processed at each layer, so we know where to next dispatch them. I still have some tool coding to do with PMC before I even think about tinkering with this as I'd like to measure stuff like per-packet latency as well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.) Thanks, -adrian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmo=N=HnZVCD41ZmDg2GwNnoa-tD0J0QLH80x=f7KA5d%2BUg>