From owner-freebsd-current@FreeBSD.ORG Tue Aug 6 22:55:43 2013 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 90825B9F; Tue, 6 Aug 2013 22:55:43 +0000 (UTC) (envelope-from rizzo.unipi@gmail.com) Received: from mail-la0-x22e.google.com (mail-la0-x22e.google.com [IPv6:2a00:1450:4010:c03::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 002AE2957; Tue, 6 Aug 2013 22:55:41 +0000 (UTC) Received: by mail-la0-f46.google.com with SMTP id eh20so742602lab.19 for ; Tue, 06 Aug 2013 15:55:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=qS39UwrZbO6GBzSgTi/tEiqoEzpCy/+oszbaWnu6jlM=; b=D0G/WC1+bKvlGwGryXpLFFoX7FSVEaSimjckU1TV0zjA4GPaIiWQHOlFZf/aRjDzqq vyM+ayqJzGFg2Qoks4a5it3K+sAJajJqiQzToUuGC2Ax3xYdTF+8Qzd/ALkp2Ntmqh3w 6Nkp7NIWOPAIIDL7ZKoiZ6yTAEUxp0ro9Kvyr6PRq+2ei4xmZOdzV/uFHSs2QbBOwzbL FG7j7eoT63n9nBm2Qlf3UwKIBGE/y8XUcwf8oVPMRt7+K2UXWDblR3bA8HN3PxWHYAkh ipknG4ApY6io8XXHIaujZI1dUrL0971lw4DUyxQeJaVuTx0TBz9IxA8Wt8cXO+HO/mOn PZ2w== MIME-Version: 1.0 X-Received: by 10.112.134.202 with SMTP id pm10mr423807lbb.79.1375829739818; Tue, 06 Aug 2013 15:55:39 -0700 (PDT) Sender: rizzo.unipi@gmail.com Received: by 10.114.200.165 with HTTP; Tue, 6 Aug 2013 15:55:39 -0700 (PDT) In-Reply-To: <520127C3.3020101@freebsd.org> References: <20130805082307.GA35162@onelab2.iet.unipi.it> <2034715395.855.1375714772487.JavaMail.root@daemoninthecloset.org> <51FFDD1E.1000206@FreeBSD.org> <5200136C.8000201@freebsd.org> <20130805215319.GA43271@onelab2.iet.unipi.it> <520127C3.3020101@freebsd.org> Date: Wed, 7 Aug 2013 00:55:39 +0200 X-Google-Sender-Auth: jbbZQ-QJDDC23Bo6q3t8fnY-dD0 Message-ID: Subject: Re: [net] protecting interfaces from races between control and data ? From: Luigi Rizzo To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Adrian Chadd , current@freebsd.org, Bryan Venteicher , Navdeep Parhar , net@freebsd.org, Giuseppe Lettieri X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Aug 2013 22:55:43 -0000 thanks for the explanations and for experimenting with the various alternatives. I started this thread just to understand whether something was already in place, and to make sure that what I do with netmap is not worse than the situation we have now. I guess that while the best solution comes out, a quick and non intrusive workaround is at least follow the approach that Bryan suggested. cheers luigi On Tue, Aug 6, 2013 at 6:43 PM, Andre Oppermann wrote: > On 05.08.2013 23:53, Luigi Rizzo wrote: > >> On Mon, Aug 05, 2013 at 11:04:44PM +0200, Andre Oppermann wrote: >> >>> On 05.08.2013 19:36, Luigi Rizzo wrote: >>> >> ... >> >>> >>> [picking a post at random to reply in this thread] >>> >>>> tell whether or not we should bail out). >>>> >>> >>> Ideally we don't want to have any locks in the RX and TX path at all. >>> >> >> Ok i have read your proposal below but there are a few things >> I do not completely understand, below -- essentially I do not >> understand whether the removal of IFF_DRV_RUNNING (or equivalent) >> forces you to replace the ifp->new_tx_function() with something >> else when you want to do an "ifconfig down" >> > > Sorry, it's IFF_DRV_OACTIVE that'll go away, not IFF_DRV_RUNNING. > It's of no use with multi-queue anyways. Though we may get more > differentiated states for interface availability. > > > Specifically, here are my doubts: >> >> - Considering that the rxq lock is rarely contended >> (only when the control path is active, which in your approach >> below requires killing and restarting the ithread), >> and is acquired/released only once per interrupt/batch, >> i am under the impression that the per-queue RX lock is >> not a performance problem and makes it easier to reason >> on the code (and does not require different approach >> for MSI-x and other cases). >> > > There are two important things here: 1) there must only be one > (i)thread per RX queue to prevent lock contention; 2) even with > a non-contended lock there are effects felt by the other cores > or CPUs like bus lock cycles which are considerable. Stopping > and restarting the ithread/taskqueue in those cases that require > more involved (hardware) reconfiguration isn't much of an overhead > compared to per-packet or per-packet-batch locking in the hot path. > > > - in the tx datapath, do you want to acquire the txq lock >> before or after calling ifp->new_tx_function() ? >> (btw how it compares to ifp->if_start or ifp->if_transmit ?) >> > > Struct ifnet is going to be split into two parts, the stack owned > part and the driver owned part. The stack will never fumble the > driver owned parts and the driver must only change the stack owned > parts through accessor functions. (This split has also been proposed > by Juniper quite some time ago but for different reasons). > > The driver supplies a TX frame transmit function (mostly like if_transmit > today) which does all locking and multi-queue handling internally (driver > owned. This gives driver writers the freedom to better adjust to different > hardware communication methods as they become available, like we witnessed > a couple of times in the past. > > If the driver is multi-queue capable it takes the rss-hash from the packet > header > to select an appropriate queue which may have its own dedicated lock. If > there > is only one queue then it will obtain that lock which may see some > contention > when multiple cores try to send at the same time. The driver may do more > extensive locking, however that likely comes at the expense of performance. > > Typically on modern NICs the TX function will be a thin locked wrapper > around > the DMA enqueue function. For or older or other kinds of interfaces it may > implement full internal soft queuing (as the IFQ did). An efficient > generic > implementation of those will be provided for driver to make use of. > > > > If you do it within the tx handler then you lose the ability > > of replacing the handler when you do a reconfiguration, > > because grabbing the tx lock in the control path does not tell > > you whether anyone is in the handler. > > If you do it outside, then the driver should also expose the locks, > > or some locking function. > > The stack calls the transmit function without any driver-specific locks > held. > It'll make sure though that the stack-ifnet doesn't go away in between > probably > by using cheap rmlocks. > > The drivers control path gets a function to ensure that the stack has > drained > all writers and it is safe to reconfigure (as in callout_drain()). Not all > hardware and control path changes necessarily require a reinitialization. > > The stack-ifnet shadows some information, like interface state, and may do > its > own independent SMP optimizations to avoid contention. > > > Overall, it seems to me that keeping the IFF_DRV_RUNNING >> flag is a lot more practical, not to mention fewer modifications >> to the code. >> > > Generally we want to optimize for the fast packet path, reduce any > friction we > can, and take a hit on reconfig being more "expensive" as it is much less > frequent. > > Mind you not all of this is worked out in detail yet and may be subject to > further changes and refinements as more benchmarking of different > approaches > is performed. > > -- > Andre > > [PS: I'm currently on summer vacation and write this while having access] > > > [description of Andre's proposal below, for reference] >> >> cheers >> luigi >> >> Every queue has its own separate RX interrupt vector which is serviced by >>> a single dedicated ithread (or taskqueue) which does while(1) for work on >>> the RX ring only going to sleep when it reaches the end of work on the >>> ring. >>> It is then woken up by an interrupt to resume work. To prevent a >>> live-lock >>> on the CPU it would periodically yield to allow other kernel and >>> user-space >>> threads to run in between. Each queue's ithread (taskqueue) can be >>> pinned >>> to a particular core or float with a certain stickiness. To reconfigure >>> the >>> card (when the RX ring is affected) the ithread (taskqueue) is terminated >>> and after the changes restarted again. This is done by the control path >>> and allows us to run RX completely lock free because there is only ever >>> one >>> ithread accessing the RX queue doing away with the need for further >>> locking >>> synchronization. >>> >> >> ok I admit i did not think of killing and restarting the ithread, >> but i wonder how >> >>> RX with MSI or legacy interrupts: >>> >>> Here multi-queue is impeded because of some synchronization issues. >>> Depending >>> on the hardware synchronization model the ithreads again do while(1) but >>> have >>> to be made runnable by the interrupt handler when new work has been >>> indicated. >>> >>> TX in general: >>> >>> This is a bit more tricky as we have the hard requirement of packet >>> ordering >>> for individual sessions (tcp and others). That means we must use the >>> same >>> queue for all packets of the same session. This makes running TX >>> non-locked >>> a bit tricky because when we pin a TX queue to a core we must switch to >>> that >>> core first before adding anything to the queue. If the packet was >>> generated >>> on that core there is no overhead, OTOH if not there is the scheduler >>> over- >>> head and some cache losses. Ensuring that a the entire TX path, possibly >>> including user-space (write et al) happens on the same core is difficult >>> and >>> may have its own inefficiencies. The number of TX queue vs. number of >>> cores >>> is another point of contention. To make the 1:1 scheme work well, one >>> must >>> have as many queues as cores. >>> >>> A more scalable and generic approach doesn't bind TX queues to any >>> particular >>> core and is covers each by its own lock to protect the TX ring enqueue. >>> To >>> prevent false lock cache line sharing each TX structure should be on its >>> own >>> cache line. As each session has an affinity hash they become distributed >>> across the TX queues significantly reducing contention. >>> >>> The TX complete handling freeing the mbuf(s) is done the same way as for >>> the >>> RX ring with a dedicated ithread (taskqueue) for each. Also bpf should >>> hook >>> in here (the packet has been sent) instead of at the TX enqueue time. >>> >>> The whole concept of IFF_DRV_RUNNING will go away along with the IFQ >>> macros. >>> Each driver then provides a direct TX function pointer which is put into >>> a >>> decoupled ifnet TX field for use by the stack. Under normal operation >>> all >>> interface TX goes directly into TX DMA ring. The particular multi-queue >>> and locking model is decided by the driver. The kernel provides a couple >>> of common optimized abstractions for use by all drivers that want/need it >>> to avoid code and functionality duplication. When things like ALTQ are >>> activated on an interface the ifnet TX function pointer is replaced with >>> the appropriate intermediate function pointer which eventually will call >>> the drivers TX function. The intermediate TX packet handler (ALTQ) can >>> do its own additional locking on top of the drivers TX locking. >>> >>> Control path: >>> >>> The control path is separate from the high speed TX and RX data paths. >>> It has its own overall lock. Multiple locks typically don't make sense >>> here. If it needs to modify, reconfigure, or newly set up the TX or RX >>> rings then it has to stop the RX ithreads (taskqueues) and to lock the >>> TX queue locks before it can do that. After the changes are made and >>> stable packet processing can continue. >>> >>> I've adjusted / heavily modified fxp, bge and igb drivers in my >>> tcp_workqueue >>> branch to test these concepts. Not all of it is committed or fully up >>> to date. >>> >>> -- >>> Andre >>> >>> >> >> > -- -----------------------------------------+------------------------------- Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+-------------------------------