From owner-freebsd-net@FreeBSD.ORG Thu Jan 19 16:41:29 2012 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B0B321065686; Thu, 19 Jan 2012 16:41:29 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 70A4A8FC27; Thu, 19 Jan 2012 16:41:29 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170]) by cyrus.watson.org (Postfix) with ESMTPSA id 2515646B06; Thu, 19 Jan 2012 11:41:29 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 851E2B922; Thu, 19 Jan 2012 11:41:28 -0500 (EST) From: John Baldwin To: net@freebsd.org Date: Thu, 19 Jan 2012 11:41:25 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201201191141.25998.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 19 Jan 2012 11:41:28 -0500 (EST) Cc: Ed Maste , Navdeep Parhar Subject: Latency issues with buf_ring X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jan 2012 16:41:29 -0000 The current buf_ring usage in various NIC drivers has a race that can result in really high packet latencies in some cases. Specifically, the common pattern in an if_transmit routine is to use a try-lock on the queue and if that fails enqueue the packet in the buf_ring and return. The race, of course, is that the thread holding the lock might have just finished checking the buf_ring and found it empty and be in the process of releasing the lock when the original thread fails the try lock. If this happens, then the packet queued by the first thread will be stalled until another thread tries to transmit packets for that queue. Some drivers attempt to handle this race (igb(4) schedules a task to kick the transmit queue if the try lock fails) and others don't (cxgb(4) doesn't handle it at all). At work this race was triggered very often after upgrading from 7 to 8 with bursty traffic and caused numerous problems, so it is not a rare occurrence and needs to be addressed. (Note, all patches mentioned are against 8) The first hack I tried to use was to simply always lock the queue after the drbr enqueue if the try lock failed and then drain the queue if needed (www.freebsd.org/~jhb/patches/try_fail.patch). While this fixed my latency problems, it would seem that this breaks other workloads that the drbr design is trying to optimize. After further hacking what I came up with was a variant of drbr_enqueue() that would atomically set a 'pending' flag. During the enqueue operation. The first thread to fail the try lock sets this flag (it is told that it set the flag by a new return value (EINPROGRESS) from the enqueue call). The pending thread then explicitly clears the flag once it acquires the queue lock. This should prevent multiple threads from stacking up on the queue lock so that if multiple threads are dumping packets into the ring concurrently all but two (the one draining the queue currently and the one waiting for the lock) can continue to drain the queue. One downside of this approach though is that each driver has to be changed to make an explicit call to clear the pending flag after grabbing the queue lock if the try lock fails. This is what I am currently running in production (www.freebsd.org/~jhb/patches/try_fail3.patch). However, this still results in a lot of duplicated code in each driver that wants to support multiq. Several folks have expressed a desire to move in a direction where the stack has explicit knowledge of transmit queues allowing us to hoist some of this duplicated code out of the drivers and up into the calling layer. After discussing this a bit with Navdeep (np@), the approach I am looking at is to alter the buf_ring code flow a bit to more closely model the older code-flow with IFQ and if_start methods. That is, have the if_transmit methods always enqueue each packet that arrives to the buf_ring and then to call an if_start-like method that drains a specific transmit queue. This approach simplifies a fair bit of driver code and means we can potentially move the enqueue, etc. bits up into the calling layer and instead have drivers provide the per-transmit queue start routine as the direct function pointer to the upper layers ala if_start. However, we would still need a way to close the latency race. I've attempted to do that by inverting my previous 'thread pending' flag. Instead, I make the buf_ring store a 'busy' flag. This flag is managed by the single-consumer buf_ring dequeue method (that drbr_dequeue() uses). It is set to true when a packet is removed from the queue while there are more packets pending. Conversely, if there are no other pending packets then it is set to false. The assumption is that once a thread starts draining the queue, it will not stop until the queue is empty (or if it has to stop for some other reason such as the transmit ring being full, the driver will restart draining of the queue until it is empty, e.g. after it receives a transmit completion interrupt). Now when the if_transmit routine enqueues the packet, it will get either a real error, 0 if the packet was enqueued and the queue was not idle, or EINPROGRESS if the packet was enqueued and the queue was busy. For the EINPROGRESS case the if_transmit routine just returns success. For the 0 case it does a blocking lock on the queue lock and calls the queue's start routine (note that this means that the busy flag is similar to the old OACTIVE interface flag). This does mean that in some cases you may have one thread that is sending what was the last packet in the buf_ring holding the lock when another thread blocks, and that the first thread will see the new packet when it loops back around so that the second thread is wasting it's time spinning, but in the common case I believe it will give the same parallelism as the current code. OTOH, there is nothing to prevent multiple threads from "stacking up" in the new approach. At least the try_fail3 patch ensured only one thread at a time would ever potentially block on the queue lock. Another approach might be to replace the 'busy' flag with the 'thread pending' flag from try_fail3.patch, but to clear the 'thread pending' flag anytime the dequeue method is called rather than using an explicit 'clear pending' method. (Hadn't thought of that until writing this e-mail.) That would prevent multiple threads from waiting on the queue lock perhaps. Note that the 'busy' approach (or the modification I mentioned above) does rely on the assumption I stated above, i.e. once a driver starts draining a queue, it will drain it until empty unless it hits an "error" condition (link went down, transmit ring full, etc.). If it hits an "error" condition, the driver is responsible for restarting transmit when the condition clears. I believe our drivers already work this way now. The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch -- John Baldwin