From owner-freebsd-net@FreeBSD.ORG Sun Nov 18 21:49:42 2012 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6ADAB9BC for ; Sun, 18 Nov 2012 21:49:42 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 971C58FC18 for ; Sun, 18 Nov 2012 21:49:41 +0000 (UTC) Received: (qmail 71061 invoked from network); 18 Nov 2012 23:22:53 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 18 Nov 2012 23:22:53 -0000 Message-ID: <50A957F3.7020003@freebsd.org> Date: Sun, 18 Nov 2012 22:49:39 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Barney Cordoba Subject: Re: Latency issues with buf_ring References: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> In-Reply-To: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org, John Baldwin X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Nov 2012 21:49:42 -0000 On 18.11.2012 18:24, Barney Cordoba wrote: > > > --- On Thu, 1/19/12, John Baldwin wrote: > >> From: John Baldwin >> Subject: Latency issues with buf_ring >> To: net@freebsd.org >> Cc: "Ed Maste" , "Navdeep Parhar" >> Date: Thursday, January 19, 2012, 11:41 AM >> The current buf_ring usage in various >> NIC drivers has a race that can >> result in really high packet latencies in some cases. >> Specifically, >> the common pattern in an if_transmit routine is to use a >> try-lock on >> the queue and if that fails enqueue the packet in the >> buf_ring and >> return. The race, of course, is that the thread >> holding the lock >> might have just finished checking the buf_ring and found it >> empty and >> be in the process of releasing the lock when the original >> thread fails >> the try lock. If this happens, then the packet queued >> by the first >> thread will be stalled until another thread tries to >> transmit packets >> for that queue. Some drivers attempt to handle this >> race (igb(4) >> schedules a task to kick the transmit queue if the try lock >> fails) and >> others don't (cxgb(4) doesn't handle it at all). At >> work this race >> was triggered very often after upgrading from 7 to 8 with >> bursty >> traffic and caused numerous problems, so it is not a rare >> occurrence >> and needs to be addressed. >> >> (Note, all patches mentioned are against 8) >> >> The first hack I tried to use was to simply always lock the >> queue after >> the drbr enqueue if the try lock failed and then drain the >> queue if >> needed (www.freebsd.org/~jhb/patches/try_fail.patch). >> While this fixed >> my latency problems, it would seem that this breaks other >> workloads >> that the drbr design is trying to optimize. >> >> After further hacking what I came up with was a variant of >> drbr_enqueue() >> that would atomically set a 'pending' flag. During the >> enqueue operation. >> The first thread to fail the try lock sets this flag (it is >> told that it >> set the flag by a new return value (EINPROGRESS) from the >> enqueue call). >> The pending thread then explicitly clears the flag once it >> acquires the >> queue lock. This should prevent multiple threads from >> stacking up on the >> queue lock so that if multiple threads are dumping packets >> into the ring >> concurrently all but two (the one draining the queue >> currently and the >> one waiting for the lock) can continue to drain the >> queue. One downside >> of this approach though is that each driver has to be >> changed to make >> an explicit call to clear the pending flag after grabbing >> the queue lock >> if the try lock fails. This is what I am currently >> running in production >> (www.freebsd.org/~jhb/patches/try_fail3.patch). >> >> However, this still results in a lot of duplicated code in >> each driver >> that wants to support multiq. Several folks have >> expressed a desire >> to move in a direction where the stack has explicit >> knowledge of >> transmit queues allowing us to hoist some of this duplicated >> code out >> of the drivers and up into the calling layer. After >> discussing this a >> bit with Navdeep (np@), the approach I am looking at is to >> alter the >> buf_ring code flow a bit to more closely model the older >> code-flow >> with IFQ and if_start methods. That is, have the >> if_transmit methods >> always enqueue each packet that arrives to the buf_ring and >> then to >> call an if_start-like method that drains a specific transmit >> queue. >> This approach simplifies a fair bit of driver code and means >> we can >> potentially move the enqueue, etc. bits up into the calling >> layer and >> instead have drivers provide the per-transmit queue start >> routine as >> the direct function pointer to the upper layers ala >> if_start. >> >> However, we would still need a way to close the latency >> race. I've >> attempted to do that by inverting my previous 'thread >> pending' flag. >> Instead, I make the buf_ring store a 'busy' flag. This >> flag is >> managed by the single-consumer buf_ring dequeue method >> (that >> drbr_dequeue() uses). It is set to true when a packet >> is removed from >> the queue while there are more packets pending. >> Conversely, if there >> are no other pending packets then it is set to false. >> The assumption >> is that once a thread starts draining the queue, it will not >> stop >> until the queue is empty (or if it has to stop for some >> other reason >> such as the transmit ring being full, the driver will >> restart draining >> of the queue until it is empty, e.g. after it receives a >> transmit >> completion interrupt). Now when the if_transmit >> routine enqueues the >> packet, it will get either a real error, 0 if the packet was >> enqueued >> and the queue was not idle, or EINPROGRESS if the packet was >> enqueued >> and the queue was busy. For the EINPROGRESS case the >> if_transmit >> routine just returns success. For the 0 case it does a >> blocking lock >> on the queue lock and calls the queue's start routine (note >> that this >> means that the busy flag is similar to the old OACTIVE >> interface >> flag). This does mean that in some cases you may have >> one thread that >> is sending what was the last packet in the buf_ring holding >> the lock >> when another thread blocks, and that the first thread will >> see the new >> packet when it loops back around so that the second thread >> is wasting >> it's time spinning, but in the common case I believe it will >> give the >> same parallelism as the current code. OTOH, there is >> nothing to >> prevent multiple threads from "stacking up" in the new >> approach. At >> least the try_fail3 patch ensured only one thread at a time >> would ever >> potentially block on the queue lock. >> >> Another approach might be to replace the 'busy' flag with >> the 'thread >> pending' flag from try_fail3.patch, but to clear the 'thread >> pending' >> flag anytime the dequeue method is called rather than using >> an >> explicit 'clear pending' method. (Hadn't thought of >> that until >> writing this e-mail.) That would prevent multiple >> threads from >> waiting on the queue lock perhaps. >> >> Note that the 'busy' approach (or the modification I >> mentioned above) >> does rely on the assumption I stated above, i.e. once a >> driver starts >> draining a queue, it will drain it until empty unless it >> hits an >> "error" condition (link went down, transmit ring full, >> etc.). If it >> hits an "error" condition, the driver is responsible for >> restarting >> transmit when the condition clears. I believe our >> drivers already >> work this way now. >> >> The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch >> >> -- >> John Baldwin > > Q1: Has this been corrected? I don't know. > Q2: Are there any case studies or benchmarks for buf_ring, or it is just > blindly being used because someone claimed it was better and offered it > for free? One of the points of locking is to avoid race conditions, so > the fact that you have races in a supposed lock-less scheme seems more > than just ironic. At the moment our entire ifnet/driver handoff is not optimal. Especially the double buffering in the DMA rings and the IFQ (be it the old one or buf_ring) is troubling. See "Buffer Bloat" for one aspect of it. While hacking up a couple of drivers for automatic irq/polling experimenting I've come up with a couple of ideas in this area that I'm implementing right now. When it's ready meaningful benchmarks will be run to assess the impact. -- Andre