From owner-freebsd-net@FreeBSD.ORG Tue Sep 24 08:47:29 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BA309A73 for ; Tue, 24 Sep 2013 08:47:29 +0000 (UTC) (envelope-from lists@rewt.org.uk) Received: from hosted.mx.as41113.net (abby.lhr1.as41113.net [91.208.177.20]) by mx1.freebsd.org (Postfix) with ESMTP id 2BE042BD6 for ; Tue, 24 Sep 2013 08:47:28 +0000 (UTC) Received: from [192.168.1.218] (unknown [212.9.98.193]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: lists@rewt.org.uk) by hosted.mx.as41113.net (Postfix) with ESMTPSA id 3ckbct5G6lzRS for ; Tue, 24 Sep 2013 09:47:26 +0100 (BST) Message-ID: <5241519C.9040908@rewt.org.uk> Date: Tue, 24 Sep 2013 09:47:24 +0100 From: Joe Holden User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: Network stack changes References: <521E41CB.30700@yandex-team.ru> <523F4F14.9090404@yandex-team.ru> <201309240958.06172.zec@fer.hr> In-Reply-To: <201309240958.06172.zec@fer.hr> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Sep 2013 08:47:29 -0000 On 24/09/2013 08:58, Marko Zec wrote: > On Tuesday 24 September 2013 00:46:46 Sami Halabi wrote: >> Hi, >> >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf>> et.unipi.it/~luigi/papers/20120601-dxr.pdf> >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff>> er.hr/dxr/stable_8_20120824.diff> >> >> I've tried the diff in 10-current, applied cleanly but had errors >> compiling new kernel... is there any work to make it work? i'd love to >> test it. > > Even if you'd make it compile on current, you could only run synthetic tests > measuring lookup performance using streams of random keys, as outlined in > the paper (btw. the paper at Luigi's site is an older draft, the final > version with slightly revised benchmarks is available here: > http://www.sigcomm.org/sites/default/files/ccr/papers/2012/October/2378956-2378961.pdf) > > I.e. the code only hooks into the routing API for testing purposes, but is > completely disconnected from the forwarding path. > aha! How much work would it be to enable it to be used? > We have a prototype in the works which combines DXR with Netmap in userspace > and is capable of sustaining well above line rate forwarding with > full-sized BGP views using Intel 10G cards on commodity multicore machines. > The work was somewhat stalled during the summer but I plan to wrap it up > and release the code until the end of this year. With recent advances in > netmap it might also be feasible to merge DXR and netmap entirely inside > the kernel but I've not explored that path yet... > mmm, forwarding using netmap would be pretty awesome... > Marko > > >> Sami >> >> >> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov < >> >> melifaro@yandex-team.ru> wrote: >>> On 29.08.2013 15:49, Adrian Chadd wrote: >>>> Hi, >>> >>> Hello Adrian! >>> I'm very sorry for the looong reply. >>> >>>> There's a lot of good stuff to review here, thanks! >>>> >>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to >>>> keep locking things like that on a per-packet basis. We should be able >>>> to do this in a cleaner way - we can defer RX into a CPU pinned >>>> taskqueue and convert the interrupt handler to a fast handler that >>>> just schedules that taskqueue. We can ignore the ithread entirely >>>> here. >>>> >>>> What do you think? >>> >>> Well, it sounds good :) But performance numbers and Jack opinion is >>> more important :) >>> >>> Are you going to Malta? >>> >>>> Totally pie in the sky handwaving at this point: >>>> >>>> * create an array of mbuf pointers for completed mbufs; >>>> * populate the mbuf array; >>>> * pass the array up to ether_demux(). >>>> >>>> For vlan handling, it may end up populating its own list of mbufs to >>>> push up to ether_demux(). So maybe we should extend the API to have a >>>> bitmap of packets to actually handle from the array, so we can pass up >>>> a larger array of mbufs, note which ones are for the destination and >>>> then the upcall can mark which frames its consumed. >>>> >>>> I specifically wonder how much work/benefit we may see by doing: >>>> >>>> * batching packets into lists so various steps can batch process >>>> things rather than run to completion; >>>> * batching the processing of a list of frames under a single lock >>>> instance - eg, if the forwarding code could do the forwarding lookup >>>> for 'n' packets under a single lock, then pass that list of frames up >>>> to inet_pfil_hook() to do the work under one lock, etc, etc. >>> >>> I'm thinking the same way, but we're stuck with 'forwarding lookup' due >>> to problem with egress interface pointer, as I mention earlier. However >>> it is interesting to see how much it helps, regardless of locking. >>> >>> Currently I'm thinking that we should try to change radix to something >>> different (it seems that it can be checked fast) and see what happened. >>> Luigi's performance numbers for our radix are too awful, and there is a >>> patch implementing alternative trie: >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf>> et.unipi.it/~luigi/papers/20120601-dxr.pdf> >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff>> er.hr/dxr/stable_8_20120824.diff> >>> >>>> Here, the processing would look less like "grab lock and process to >>>> completion" and more like "mark and sweep" - ie, we have a list of >>>> frames that we mark as needing processing and mark as having been >>>> processed at each layer, so we know where to next dispatch them. >>>> >>>> I still have some tool coding to do with PMC before I even think about >>>> tinkering with this as I'd like to measure stuff like per-packet >>>> latency as well as top-level processing overhead (ie, >>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC >>>> interrupts on that core, etc.) >>> >>> That will be great to see! >>> >>>> Thanks, >>>> >>>> >>>> >>>> -adrian >>> >>> ______________________________**_________________ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/**mailman/listinfo/freebsd-net>> eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail to >>> "freebsd-net-unsubscribe@**freebsd.org>> org> " >> >> -- >> Sami Halabi >> Information Systems Engineer >> NMS Projects Expert >> FreeBSD SysAdmin Expert >> >> >> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov < >> >> melifaro@yandex-team.ru> wrote: >>> On 29.08.2013 15:49, Adrian Chadd wrote: >>>> Hi, >>> >>> Hello Adrian! >>> I'm very sorry for the looong reply. >>> >>>> There's a lot of good stuff to review here, thanks! >>>> >>>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to >>>> keep locking things like that on a per-packet basis. We should be able >>>> to do this in a cleaner way - we can defer RX into a CPU pinned >>>> taskqueue and convert the interrupt handler to a fast handler that >>>> just schedules that taskqueue. We can ignore the ithread entirely >>>> here. >>>> >>>> What do you think? >>> >>> Well, it sounds good :) But performance numbers and Jack opinion is >>> more important :) >>> >>> Are you going to Malta? >>> >>>> Totally pie in the sky handwaving at this point: >>>> >>>> * create an array of mbuf pointers for completed mbufs; >>>> * populate the mbuf array; >>>> * pass the array up to ether_demux(). >>>> >>>> For vlan handling, it may end up populating its own list of mbufs to >>>> push up to ether_demux(). So maybe we should extend the API to have a >>>> bitmap of packets to actually handle from the array, so we can pass up >>>> a larger array of mbufs, note which ones are for the destination and >>>> then the upcall can mark which frames its consumed. >>>> >>>> I specifically wonder how much work/benefit we may see by doing: >>>> >>>> * batching packets into lists so various steps can batch process >>>> things rather than run to completion; >>>> * batching the processing of a list of frames under a single lock >>>> instance - eg, if the forwarding code could do the forwarding lookup >>>> for 'n' packets under a single lock, then pass that list of frames up >>>> to inet_pfil_hook() to do the work under one lock, etc, etc. >>> >>> I'm thinking the same way, but we're stuck with 'forwarding lookup' due >>> to problem with egress interface pointer, as I mention earlier. However >>> it is interesting to see how much it helps, regardless of locking. >>> >>> Currently I'm thinking that we should try to change radix to something >>> different (it seems that it can be checked fast) and see what happened. >>> Luigi's performance numbers for our radix are too awful, and there is a >>> patch implementing alternative trie: >>> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf>> et.unipi.it/~luigi/papers/20120601-dxr.pdf> >>> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff>> er.hr/dxr/stable_8_20120824.diff> >>> >>>> Here, the processing would look less like "grab lock and process to >>>> completion" and more like "mark and sweep" - ie, we have a list of >>>> frames that we mark as needing processing and mark as having been >>>> processed at each layer, so we know where to next dispatch them. >>>> >>>> I still have some tool coding to do with PMC before I even think about >>>> tinkering with this as I'd like to measure stuff like per-packet >>>> latency as well as top-level processing overhead (ie, >>>> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC >>>> interrupts on that core, etc.) >>> >>> That will be great to see! >>> >>>> Thanks, >>>> >>>> >>>> >>>> -adrian >>>