Date: Tue, 4 Dec 2012 11:08:17 -0500 From: John Baldwin <jhb@freebsd.org> To: Barney Cordoba <barney_cordoba@yahoo.com> Cc: freebsd-net@freebsd.org Subject: Re: Latency issues with buf_ring Message-ID: <201212041108.17645.jhb@freebsd.org> In-Reply-To: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> References: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sunday, November 18, 2012 12:24:01 pm Barney Cordoba wrote: > > --- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote: > > > From: John Baldwin <jhb@freebsd.org> > > Subject: Latency issues with buf_ring > > To: net@freebsd.org > > Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep Parhar" <np@freebsd.org> > > Date: Thursday, January 19, 2012, 11:41 AM > > The current buf_ring usage in various > > NIC drivers has a race that can > > result in really high packet latencies in some cases. > > Specifically, > > the common pattern in an if_transmit routine is to use a > > try-lock on > > the queue and if that fails enqueue the packet in the > > buf_ring and > > return. The race, of course, is that the thread > > holding the lock > > might have just finished checking the buf_ring and found it > > empty and > > be in the process of releasing the lock when the original > > thread fails > > the try lock. If this happens, then the packet queued > > by the first > > thread will be stalled until another thread tries to > > transmit packets > > for that queue. Some drivers attempt to handle this > > race (igb(4) > > schedules a task to kick the transmit queue if the try lock > > fails) and > > others don't (cxgb(4) doesn't handle it at all). At > > work this race > > was triggered very often after upgrading from 7 to 8 with > > bursty > > traffic and caused numerous problems, so it is not a rare > > occurrence > > and needs to be addressed. > > > > (Note, all patches mentioned are against 8) > > > > The first hack I tried to use was to simply always lock the > > queue after > > the drbr enqueue if the try lock failed and then drain the > > queue if > > needed (www.freebsd.org/~jhb/patches/try_fail.patch). > > While this fixed > > my latency problems, it would seem that this breaks other > > workloads > > that the drbr design is trying to optimize. > > > > After further hacking what I came up with was a variant of > > drbr_enqueue() > > that would atomically set a 'pending' flag. During the > > enqueue operation. > > The first thread to fail the try lock sets this flag (it is > > told that it > > set the flag by a new return value (EINPROGRESS) from the > > enqueue call). > > The pending thread then explicitly clears the flag once it > > acquires the > > queue lock. This should prevent multiple threads from > > stacking up on the > > queue lock so that if multiple threads are dumping packets > > into the ring > > concurrently all but two (the one draining the queue > > currently and the > > one waiting for the lock) can continue to drain the > > queue. One downside > > of this approach though is that each driver has to be > > changed to make > > an explicit call to clear the pending flag after grabbing > > the queue lock > > if the try lock fails. This is what I am currently > > running in production > > (www.freebsd.org/~jhb/patches/try_fail3.patch). > > > > However, this still results in a lot of duplicated code in > > each driver > > that wants to support multiq. Several folks have > > expressed a desire > > to move in a direction where the stack has explicit > > knowledge of > > transmit queues allowing us to hoist some of this duplicated > > code out > > of the drivers and up into the calling layer. After > > discussing this a > > bit with Navdeep (np@), the approach I am looking at is to > > alter the > > buf_ring code flow a bit to more closely model the older > > code-flow > > with IFQ and if_start methods. That is, have the > > if_transmit methods > > always enqueue each packet that arrives to the buf_ring and > > then to > > call an if_start-like method that drains a specific transmit > > queue. > > This approach simplifies a fair bit of driver code and means > > we can > > potentially move the enqueue, etc. bits up into the calling > > layer and > > instead have drivers provide the per-transmit queue start > > routine as > > the direct function pointer to the upper layers ala > > if_start. > > > > However, we would still need a way to close the latency > > race. I've > > attempted to do that by inverting my previous 'thread > > pending' flag. > > Instead, I make the buf_ring store a 'busy' flag. This > > flag is > > managed by the single-consumer buf_ring dequeue method > > (that > > drbr_dequeue() uses). It is set to true when a packet > > is removed from > > the queue while there are more packets pending. > > Conversely, if there > > are no other pending packets then it is set to false. > > The assumption > > is that once a thread starts draining the queue, it will not > > stop > > until the queue is empty (or if it has to stop for some > > other reason > > such as the transmit ring being full, the driver will > > restart draining > > of the queue until it is empty, e.g. after it receives a > > transmit > > completion interrupt). Now when the if_transmit > > routine enqueues the > > packet, it will get either a real error, 0 if the packet was > > enqueued > > and the queue was not idle, or EINPROGRESS if the packet was > > enqueued > > and the queue was busy. For the EINPROGRESS case the > > if_transmit > > routine just returns success. For the 0 case it does a > > blocking lock > > on the queue lock and calls the queue's start routine (note > > that this > > means that the busy flag is similar to the old OACTIVE > > interface > > flag). This does mean that in some cases you may have > > one thread that > > is sending what was the last packet in the buf_ring holding > > the lock > > when another thread blocks, and that the first thread will > > see the new > > packet when it loops back around so that the second thread > > is wasting > > it's time spinning, but in the common case I believe it will > > give the > > same parallelism as the current code. OTOH, there is > > nothing to > > prevent multiple threads from "stacking up" in the new > > approach. At > > least the try_fail3 patch ensured only one thread at a time > > would ever > > potentially block on the queue lock. > > > > Another approach might be to replace the 'busy' flag with > > the 'thread > > pending' flag from try_fail3.patch, but to clear the 'thread > > pending' > > flag anytime the dequeue method is called rather than using > > an > > explicit 'clear pending' method. (Hadn't thought of > > that until > > writing this e-mail.) That would prevent multiple > > threads from > > waiting on the queue lock perhaps. > > > > Note that the 'busy' approach (or the modification I > > mentioned above) > > does rely on the assumption I stated above, i.e. once a > > driver starts > > draining a queue, it will drain it until empty unless it > > hits an > > "error" condition (link went down, transmit ring full, > > etc.). If it > > hits an "error" condition, the driver is responsible for > > restarting > > transmit when the condition clears. I believe our > > drivers already > > work this way now. > > > > The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch > > > > -- > > John Baldwin > > Q1: Has this been corrected? No. I've yet to been able to raise a meaningful discussion about possible solutions to this. > Q2: Are there any case studies or benchmarks for buf_ring, or it is just > blindly being used because someone claimed it was better and offered it > for free? One of the points of locking is to avoid race conditions, so the fact that you have races in a supposed lock-less scheme seems more than just ironic. The buf_ring author claims it has benefits in high pps workloads. I am not aware of any benchmarks, etc. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201212041108.17645.jhb>