From owner-freebsd-hackers@FreeBSD.ORG  Tue Sep 24 07:58:31 2013
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 2E99BDED;
 Tue, 24 Sep 2013 07:58:31 +0000 (UTC) (envelope-from zec@fer.hr)
Received: from mail.fer.hr (mail.fer.hr [161.53.72.233])
 (using TLSv1 with cipher AES128-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 5411428EC;
 Tue, 24 Sep 2013 07:58:29 +0000 (UTC)
Received: from dyn10.nxlab.fer.hr (161.53.63.210) by MAIL.fer.hr
 (161.53.72.233) with Microsoft SMTP Server (TLS) id 14.2.342.3; Tue, 24 Sep
 2013 09:57:14 +0200
From: Marko Zec <zec@fer.hr>
To: <freebsd-hackers@freebsd.org>
Subject: Re: Network stack changes
Date: Tue, 24 Sep 2013 09:58:05 +0200
User-Agent: KMail/1.9.10
References: <521E41CB.30700@yandex-team.ru> <523F4F14.9090404@yandex-team.ru>
 <CAEW+ogZttyScUBQQWht+YGfLEDU_APcoRyYeMy_wDseAcZwVnA@mail.gmail.com>
In-Reply-To: <CAEW+ogZttyScUBQQWht+YGfLEDU_APcoRyYeMy_wDseAcZwVnA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-ID: <201309240958.06172.zec@fer.hr>
X-Originating-IP: [161.53.63.210]
Cc: "Alexander V. Chernikov" <melifaro@yandex-team.ru>,
 Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>,
 FreeBSD Net <net@freebsd.org>, Luigi Rizzo <luigi@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Sami Halabi <sodynet1@gmail.com>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Sep 2013 07:58:31 -0000

On Tuesday 24 September 2013 00:46:46 Sami Halabi wrote:
> Hi,
>
> > http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info.i
> >et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab.f
> >er.hr/dxr/stable_8_20120824.diff>
>
> I've tried the diff in 10-current, applied cleanly but had errors
> compiling new kernel... is there any work to make it work? i'd love to
> test it.

Even if you'd make it compile on current, you could only run synthetic tests 
measuring lookup performance using streams of random keys, as outlined in 
the paper (btw. the paper at Luigi's site is an older draft, the final 
version with slightly revised benchmarks is available here:
http://www.sigcomm.org/sites/default/files/ccr/papers/2012/October/2378956-2378961.pdf)

I.e. the code only hooks into the routing API for testing purposes, but is 
completely disconnected from the forwarding path.

We have a prototype in the works which combines DXR with Netmap in userspace 
and is capable of sustaining well above line rate forwarding with 
full-sized BGP views using Intel 10G cards on commodity multicore machines.  
The work was somewhat stalled during the summer but I plan to wrap it up 
and release the code until the end of this year.  With recent advances in 
netmap it might also be feasible to merge DXR and netmap entirely inside 
the kernel but I've not explored that path yet...

Marko


> Sami
>
>
> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
>
> melifaro@yandex-team.ru> wrote:
> > On 29.08.2013 15:49, Adrian Chadd wrote:
> >> Hi,
> >
> > Hello Adrian!
> > I'm very sorry for the looong reply.
> >
> >> There's a lot of good stuff to review here, thanks!
> >>
> >> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
> >> keep locking things like that on a per-packet basis. We should be able
> >> to do this in a cleaner way - we can defer RX into a CPU pinned
> >> taskqueue and convert the interrupt handler to a fast handler that
> >> just schedules that taskqueue. We can ignore the ithread entirely
> >> here.
> >>
> >> What do you think?
> >
> > Well, it sounds good :) But performance numbers and Jack opinion is
> > more important :)
> >
> > Are you going to Malta?
> >
> >> Totally pie in the sky handwaving at this point:
> >>
> >> * create an array of mbuf pointers for completed mbufs;
> >> * populate the mbuf array;
> >> * pass the array up to ether_demux().
> >>
> >> For vlan handling, it may end up populating its own list of mbufs to
> >> push up to ether_demux(). So maybe we should extend the API to have a
> >> bitmap of packets to actually handle from the array, so we can pass up
> >> a larger array of mbufs, note which ones are for the destination and
> >> then the upcall can mark which frames its consumed.
> >>
> >> I specifically wonder how much work/benefit we may see by doing:
> >>
> >> * batching packets into lists so various steps can batch process
> >> things rather than run to completion;
> >> * batching the processing of a list of frames under a single lock
> >> instance - eg, if the forwarding code could do the forwarding lookup
> >> for 'n' packets under a single lock, then pass that list of frames up
> >> to inet_pfil_hook() to do the work under one lock, etc, etc.
> >
> > I'm thinking the same way, but we're stuck with 'forwarding lookup' due
> > to problem with egress interface pointer, as I mention earlier. However
> > it is interesting to see how much it helps, regardless of locking.
> >
> > Currently I'm thinking that we should try to change radix to something
> > different (it seems that it can be checked fast) and see what happened.
> > Luigi's performance numbers for our radix are too awful, and there is a
> > patch implementing alternative trie:
> > http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info.i
> >et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab.f
> >er.hr/dxr/stable_8_20120824.diff>
> >
> >> Here, the processing would look less like "grab lock and process to
> >> completion" and more like "mark and sweep" - ie, we have a list of
> >> frames that we mark as needing processing and mark as having been
> >> processed at each layer, so we know where to next dispatch them.
> >>
> >> I still have some tool coding to do with PMC before I even think about
> >> tinkering with this as I'd like to measure stuff like per-packet
> >> latency as well as top-level processing overhead (ie,
> >> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
> >> interrupts on that core, etc.)
> >
> > That will be great to see!
> >
> >> Thanks,
> >>
> >>
> >>
> >> -adrian
> >
> > ______________________________**_________________
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.fr
> >eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail to
> > "freebsd-net-unsubscribe@**freebsd.org<freebsd-net-unsubscribe@freebsd.
> >org> "
>
> --
> Sami Halabi
> Information Systems Engineer
> NMS Projects Expert
> FreeBSD SysAdmin Expert
>
>
> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
>
> melifaro@yandex-team.ru> wrote:
> > On 29.08.2013 15:49, Adrian Chadd wrote:
> >> Hi,
> >
> > Hello Adrian!
> > I'm very sorry for the looong reply.
> >
> >> There's a lot of good stuff to review here, thanks!
> >>
> >> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
> >> keep locking things like that on a per-packet basis. We should be able
> >> to do this in a cleaner way - we can defer RX into a CPU pinned
> >> taskqueue and convert the interrupt handler to a fast handler that
> >> just schedules that taskqueue. We can ignore the ithread entirely
> >> here.
> >>
> >> What do you think?
> >
> > Well, it sounds good :) But performance numbers and Jack opinion is
> > more important :)
> >
> > Are you going to Malta?
> >
> >> Totally pie in the sky handwaving at this point:
> >>
> >> * create an array of mbuf pointers for completed mbufs;
> >> * populate the mbuf array;
> >> * pass the array up to ether_demux().
> >>
> >> For vlan handling, it may end up populating its own list of mbufs to
> >> push up to ether_demux(). So maybe we should extend the API to have a
> >> bitmap of packets to actually handle from the array, so we can pass up
> >> a larger array of mbufs, note which ones are for the destination and
> >> then the upcall can mark which frames its consumed.
> >>
> >> I specifically wonder how much work/benefit we may see by doing:
> >>
> >> * batching packets into lists so various steps can batch process
> >> things rather than run to completion;
> >> * batching the processing of a list of frames under a single lock
> >> instance - eg, if the forwarding code could do the forwarding lookup
> >> for 'n' packets under a single lock, then pass that list of frames up
> >> to inet_pfil_hook() to do the work under one lock, etc, etc.
> >
> > I'm thinking the same way, but we're stuck with 'forwarding lookup' due
> > to problem with egress interface pointer, as I mention earlier. However
> > it is interesting to see how much it helps, regardless of locking.
> >
> > Currently I'm thinking that we should try to change radix to something
> > different (it seems that it can be checked fast) and see what happened.
> > Luigi's performance numbers for our radix are too awful, and there is a
> > patch implementing alternative trie:
> > http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info.i
> >et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab.f
> >er.hr/dxr/stable_8_20120824.diff>
> >
> >> Here, the processing would look less like "grab lock and process to
> >> completion" and more like "mark and sweep" - ie, we have a list of
> >> frames that we mark as needing processing and mark as having been
> >> processed at each layer, so we know where to next dispatch them.
> >>
> >> I still have some tool coding to do with PMC before I even think about
> >> tinkering with this as I'd like to measure stuff like per-packet
> >> latency as well as top-level processing overhead (ie,
> >> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
> >> interrupts on that core, etc.)
> >
> > That will be great to see!
> >
> >> Thanks,
> >>
> >>
> >>
> >> -adrian
> >
> > ______________________________**_________________
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.fr
> >eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail to
> > "freebsd-net-unsubscribe@**freebsd.org<freebsd-net-unsubscribe@freebsd.
> >org> "