From owner-freebsd-net@FreeBSD.ORG  Tue Sep 24 08:47:29 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BA309A73
 for <freebsd-net@freebsd.org>; Tue, 24 Sep 2013 08:47:29 +0000 (UTC)
 (envelope-from lists@rewt.org.uk)
Received: from hosted.mx.as41113.net (abby.lhr1.as41113.net [91.208.177.20])
 by mx1.freebsd.org (Postfix) with ESMTP id 2BE042BD6
 for <freebsd-net@freebsd.org>; Tue, 24 Sep 2013 08:47:28 +0000 (UTC)
Received: from [192.168.1.218] (unknown [212.9.98.193])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 (Authenticated sender: lists@rewt.org.uk)
 by hosted.mx.as41113.net (Postfix) with ESMTPSA id 3ckbct5G6lzRS
 for <freebsd-net@freebsd.org>; Tue, 24 Sep 2013 09:47:26 +0100 (BST)
Message-ID: <5241519C.9040908@rewt.org.uk>
Date: Tue, 24 Sep 2013 09:47:24 +0100
From: Joe Holden <lists@rewt.org.uk>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Subject: Re: Network stack changes
References: <521E41CB.30700@yandex-team.ru> <523F4F14.9090404@yandex-team.ru>
 <CAEW+ogZttyScUBQQWht+YGfLEDU_APcoRyYeMy_wDseAcZwVnA@mail.gmail.com>
 <201309240958.06172.zec@fer.hr>
In-Reply-To: <201309240958.06172.zec@fer.hr>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Sep 2013 08:47:29 -0000

On 24/09/2013 08:58, Marko Zec wrote:
> On Tuesday 24 September 2013 00:46:46 Sami Halabi wrote:
>> Hi,
>>
>>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info.i
>>> et.unipi.it/~luigi/papers/20120601-dxr.pdf>
>>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab.f
>>> er.hr/dxr/stable_8_20120824.diff>
>>
>> I've tried the diff in 10-current, applied cleanly but had errors
>> compiling new kernel... is there any work to make it work? i'd love to
>> test it.
>
> Even if you'd make it compile on current, you could only run synthetic tests
> measuring lookup performance using streams of random keys, as outlined in
> the paper (btw. the paper at Luigi's site is an older draft, the final
> version with slightly revised benchmarks is available here:
> http://www.sigcomm.org/sites/default/files/ccr/papers/2012/October/2378956-2378961.pdf)
>
> I.e. the code only hooks into the routing API for testing purposes, but is
> completely disconnected from the forwarding path.
>
aha!  How much work would it be to enable it to be used?

> We have a prototype in the works which combines DXR with Netmap in userspace
> and is capable of sustaining well above line rate forwarding with
> full-sized BGP views using Intel 10G cards on commodity multicore machines.
> The work was somewhat stalled during the summer but I plan to wrap it up
> and release the code until the end of this year.  With recent advances in
> netmap it might also be feasible to merge DXR and netmap entirely inside
> the kernel but I've not explored that path yet...
>
mmm, forwarding using netmap would be pretty awesome...
> Marko
>
>
>> Sami
>>
>>
>> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
>>
>> melifaro@yandex-team.ru> wrote:
>>> On 29.08.2013 15:49, Adrian Chadd wrote:
>>>> Hi,
>>>
>>> Hello Adrian!
>>> I'm very sorry for the looong reply.
>>>
>>>> There's a lot of good stuff to review here, thanks!
>>>>
>>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
>>>> keep locking things like that on a per-packet basis. We should be able
>>>> to do this in a cleaner way - we can defer RX into a CPU pinned
>>>> taskqueue and convert the interrupt handler to a fast handler that
>>>> just schedules that taskqueue. We can ignore the ithread entirely
>>>> here.
>>>>
>>>> What do you think?
>>>
>>> Well, it sounds good :) But performance numbers and Jack opinion is
>>> more important :)
>>>
>>> Are you going to Malta?
>>>
>>>> Totally pie in the sky handwaving at this point:
>>>>
>>>> * create an array of mbuf pointers for completed mbufs;
>>>> * populate the mbuf array;
>>>> * pass the array up to ether_demux().
>>>>
>>>> For vlan handling, it may end up populating its own list of mbufs to
>>>> push up to ether_demux(). So maybe we should extend the API to have a
>>>> bitmap of packets to actually handle from the array, so we can pass up
>>>> a larger array of mbufs, note which ones are for the destination and
>>>> then the upcall can mark which frames its consumed.
>>>>
>>>> I specifically wonder how much work/benefit we may see by doing:
>>>>
>>>> * batching packets into lists so various steps can batch process
>>>> things rather than run to completion;
>>>> * batching the processing of a list of frames under a single lock
>>>> instance - eg, if the forwarding code could do the forwarding lookup
>>>> for 'n' packets under a single lock, then pass that list of frames up
>>>> to inet_pfil_hook() to do the work under one lock, etc, etc.
>>>
>>> I'm thinking the same way, but we're stuck with 'forwarding lookup' due
>>> to problem with egress interface pointer, as I mention earlier. However
>>> it is interesting to see how much it helps, regardless of locking.
>>>
>>> Currently I'm thinking that we should try to change radix to something
>>> different (it seems that it can be checked fast) and see what happened.
>>> Luigi's performance numbers for our radix are too awful, and there is a
>>> patch implementing alternative trie:
>>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info.i
>>> et.unipi.it/~luigi/papers/20120601-dxr.pdf>
>>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab.f
>>> er.hr/dxr/stable_8_20120824.diff>
>>>
>>>> Here, the processing would look less like "grab lock and process to
>>>> completion" and more like "mark and sweep" - ie, we have a list of
>>>> frames that we mark as needing processing and mark as having been
>>>> processed at each layer, so we know where to next dispatch them.
>>>>
>>>> I still have some tool coding to do with PMC before I even think about
>>>> tinkering with this as I'd like to measure stuff like per-packet
>>>> latency as well as top-level processing overhead (ie,
>>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
>>>> interrupts on that core, etc.)
>>>
>>> That will be great to see!
>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> -adrian
>>>
>>> ______________________________**_________________
>>> freebsd-net@freebsd.org mailing list
>>> http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.fr
>>> eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail to
>>> "freebsd-net-unsubscribe@**freebsd.org<freebsd-net-unsubscribe@freebsd.
>>> org> "
>>
>> --
>> Sami Halabi
>> Information Systems Engineer
>> NMS Projects Expert
>> FreeBSD SysAdmin Expert
>>
>>
>> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
>>
>> melifaro@yandex-team.ru> wrote:
>>> On 29.08.2013 15:49, Adrian Chadd wrote:
>>>> Hi,
>>>
>>> Hello Adrian!
>>> I'm very sorry for the looong reply.
>>>
>>>> There's a lot of good stuff to review here, thanks!
>>>>
>>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
>>>> keep locking things like that on a per-packet basis. We should be able
>>>> to do this in a cleaner way - we can defer RX into a CPU pinned
>>>> taskqueue and convert the interrupt handler to a fast handler that
>>>> just schedules that taskqueue. We can ignore the ithread entirely
>>>> here.
>>>>
>>>> What do you think?
>>>
>>> Well, it sounds good :) But performance numbers and Jack opinion is
>>> more important :)
>>>
>>> Are you going to Malta?
>>>
>>>> Totally pie in the sky handwaving at this point:
>>>>
>>>> * create an array of mbuf pointers for completed mbufs;
>>>> * populate the mbuf array;
>>>> * pass the array up to ether_demux().
>>>>
>>>> For vlan handling, it may end up populating its own list of mbufs to
>>>> push up to ether_demux(). So maybe we should extend the API to have a
>>>> bitmap of packets to actually handle from the array, so we can pass up
>>>> a larger array of mbufs, note which ones are for the destination and
>>>> then the upcall can mark which frames its consumed.
>>>>
>>>> I specifically wonder how much work/benefit we may see by doing:
>>>>
>>>> * batching packets into lists so various steps can batch process
>>>> things rather than run to completion;
>>>> * batching the processing of a list of frames under a single lock
>>>> instance - eg, if the forwarding code could do the forwarding lookup
>>>> for 'n' packets under a single lock, then pass that list of frames up
>>>> to inet_pfil_hook() to do the work under one lock, etc, etc.
>>>
>>> I'm thinking the same way, but we're stuck with 'forwarding lookup' due
>>> to problem with egress interface pointer, as I mention earlier. However
>>> it is interesting to see how much it helps, regardless of locking.
>>>
>>> Currently I'm thinking that we should try to change radix to something
>>> different (it seems that it can be checked fast) and see what happened.
>>> Luigi's performance numbers for our radix are too awful, and there is a
>>> patch implementing alternative trie:
>>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf<http://info.i
>>> et.unipi.it/~luigi/papers/20120601-dxr.pdf>
>>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff<http://www.nxlab.f
>>> er.hr/dxr/stable_8_20120824.diff>
>>>
>>>> Here, the processing would look less like "grab lock and process to
>>>> completion" and more like "mark and sweep" - ie, we have a list of
>>>> frames that we mark as needing processing and mark as having been
>>>> processed at each layer, so we know where to next dispatch them.
>>>>
>>>> I still have some tool coding to do with PMC before I even think about
>>>> tinkering with this as I'd like to measure stuff like per-packet
>>>> latency as well as top-level processing overhead (ie,
>>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
>>>> interrupts on that core, etc.)
>>>
>>> That will be great to see!
>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> -adrian
>>>