From owner-freebsd-current@FreeBSD.ORG Mon Aug 5 21:04:56 2013 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 087BF562 for ; Mon, 5 Aug 2013 21:04:56 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 38DCB2825 for ; Mon, 5 Aug 2013 21:04:54 +0000 (UTC) Received: (qmail 46896 invoked from network); 5 Aug 2013 21:50:50 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 5 Aug 2013 21:50:50 -0000 Message-ID: <5200136C.8000201@freebsd.org> Date: Mon, 05 Aug 2013 23:04:44 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: [net] protecting interfaces from races between control and data ? References: <20130805082307.GA35162@onelab2.iet.unipi.it> <2034715395.855.1375714772487.JavaMail.root@daemoninthecloset.org> <51FFDD1E.1000206@FreeBSD.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Adrian Chadd , current@freebsd.org, Bryan Venteicher , Navdeep Parhar , net@freebsd.org, Giuseppe Lettieri X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Aug 2013 21:04:56 -0000 On 05.08.2013 19:36, Luigi Rizzo wrote: > On Mon, Aug 5, 2013 at 7:17 PM, Adrian Chadd wrote: > >> I'm travelling back to San Jose today; poke me tomorrow and I'll brain >> dump what I did in ath(4) and the lessons learnt. >> >> The TL;DR version - you don't want to grab an extra lock in the >> read/write paths as that slows things down. Reuse the same per-queue >> TX/RX lock and have: >> >> * a reset flag that is set when something is resetting; that says to >> the queue "don't bother processing anything, just dive out"; >> * 'i am doing Tx / Rx' flags per queue that is set at the start of >> TX/RX servicing and finishes at the end; that way the reset code knows >> if there's something pending; >> * have the reset path grab each lock, set the 'reset' flag on each, >> then walk each queue again and make sure they're all marked as 'not >> doing TX/RX'. At that point the reset can occur, then the flag cna be >> cleared, then TX/RX can resume. >> [picking a post at random to reply in this thread] > so this is slightly different from what Bryan suggested (and you endorsed) > before, as in that case there was a single 'reset' flag IFF_DRV_RUNNING > protected by the 'core' lock, then a nested round on all tx and rx locks > to make sure that all customers have seen it. > In both cases the tx and rx paths only need the per-queue lock. > > As i see it, having a per-queue reset flag removes the need for nesting > core + queue locks, but since this is only in the control path perhaps > it is not a big deal (and is better to have a single place to look at to > tell whether or not we should bail out). Ideally we don't want to have any locks in the RX and TX path at all. For the TX path this is not easy to achieve but for RX it isn't difficult at all provided the correct model is used. My upcoming stack/driver proposal is along these lines: RX with MSI-X: Every queue has its own separate RX interrupt vector which is serviced by a single dedicated ithread (or taskqueue) which does while(1) for work on the RX ring only going to sleep when it reaches the end of work on the ring. It is then woken up by an interrupt to resume work. To prevent a live-lock on the CPU it would periodically yield to allow other kernel and user-space threads to run in between. Each queue's ithread (taskqueue) can be pinned to a particular core or float with a certain stickiness. To reconfigure the card (when the RX ring is affected) the ithread (taskqueue) is terminated and after the changes restarted again. This is done by the control path and allows us to run RX completely lock free because there is only ever one ithread accessing the RX queue doing away with the need for further locking synchronization. RX with MSI or legacy interrupts: Here multi-queue is impeded because of some synchronization issues. Depending on the hardware synchronization model the ithreads again do while(1) but have to be made runnable by the interrupt handler when new work has been indicated. TX in general: This is a bit more tricky as we have the hard requirement of packet ordering for individual sessions (tcp and others). That means we must use the same queue for all packets of the same session. This makes running TX non-locked a bit tricky because when we pin a TX queue to a core we must switch to that core first before adding anything to the queue. If the packet was generated on that core there is no overhead, OTOH if not there is the scheduler over- head and some cache losses. Ensuring that a the entire TX path, possibly including user-space (write et al) happens on the same core is difficult and may have its own inefficiencies. The number of TX queue vs. number of cores is another point of contention. To make the 1:1 scheme work well, one must have as many queues as cores. A more scalable and generic approach doesn't bind TX queues to any particular core and is covers each by its own lock to protect the TX ring enqueue. To prevent false lock cache line sharing each TX structure should be on its own cache line. As each session has an affinity hash they become distributed across the TX queues significantly reducing contention. The TX complete handling freeing the mbuf(s) is done the same way as for the RX ring with a dedicated ithread (taskqueue) for each. Also bpf should hook in here (the packet has been sent) instead of at the TX enqueue time. The whole concept of IFF_DRV_RUNNING will go away along with the IFQ macros. Each driver then provides a direct TX function pointer which is put into a decoupled ifnet TX field for use by the stack. Under normal operation all interface TX goes directly into TX DMA ring. The particular multi-queue and locking model is decided by the driver. The kernel provides a couple of common optimized abstractions for use by all drivers that want/need it to avoid code and functionality duplication. When things like ALTQ are activated on an interface the ifnet TX function pointer is replaced with the appropriate intermediate function pointer which eventually will call the drivers TX function. The intermediate TX packet handler (ALTQ) can do its own additional locking on top of the drivers TX locking. Control path: The control path is separate from the high speed TX and RX data paths. It has its own overall lock. Multiple locks typically don't make sense here. If it needs to modify, reconfigure, or newly set up the TX or RX rings then it has to stop the RX ithreads (taskqueues) and to lock the TX queue locks before it can do that. After the changes are made and stable packet processing can continue. I've adjusted / heavily modified fxp, bge and igb drivers in my tcp_workqueue branch to test these concepts. Not all of it is committed or fully up to date. -- Andre