From owner-freebsd-net@FreeBSD.ORG Thu Dec 26 17:54:19 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 34E1AB83 for ; Thu, 26 Dec 2013 17:54:19 +0000 (UTC) Received: from lath.rinet.ru (lath.rinet.ru [195.54.192.90]) by mx1.freebsd.org (Postfix) with ESMTP id 8FD5B1487 for ; Thu, 26 Dec 2013 17:54:18 +0000 (UTC) Received: by lath.rinet.ru (Postfix, from userid 222) id A3F278098; Thu, 26 Dec 2013 21:54:10 +0400 (MSK) Date: Thu, 26 Dec 2013 21:54:10 +0400 From: Oleg Bulyzhin To: Ryan Stone Subject: Re: buf_ring in HEAD is racy Message-ID: <20131226175410.GA15332@lath.rinet.ru> References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="vkogqOf2sHV7VnPd" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Dec 2013 17:54:19 -0000 --vkogqOf2sHV7VnPd Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Sat, Dec 14, 2013 at 12:04:57AM -0500, Ryan Stone wrote: > I am seeing spurious output packet drops that appear to be due to > insufficient memory barriers in buf_ring. I believe that this is the > scenario that I am seeing: > > 1) The buf_ring is empty, br_prod_head = br_cons_head = 0 > 2) Thread 1 attempts to enqueue an mbuf on the buf_ring. It fetches > br_prod_head (0) into a local variable called prod_head > 3) Thread 2 enqueues an mbuf on the buf_ring. The sequence of events > is essentially: > > Thread 2 claims an index in the ring and atomically sets br_prod_head (say to 1) > Thread 2 sets br_ring[1] = mbuf; > Thread 2 does a full memory barrier > Thread 2 updates br_prod_tail to 1 > > 4) Thread 2 dequeues the packet from the buf_ring using the > single-consumer interface. The sequence of events is essentialy: > > Thread 2 checks whether queue is empty (br_cons_head == br_prod_tail), > this is false > Thread 2 sets br_cons_head to 1 > Thread 2 grabs the mbuf from br_ring[1] > Thread 2 sets br_cons_tail to 1 > > 5) Thread 1, which is still attempting to enqueue an mbuf on the ring. > fetches br_cons_tail (1) into a local variable called cons_tail. It > sees cons_tail == 1 but prod_head == 0 and concludes that the ring is > full and drops the packet (incrementing br_drops unatomically, I might > add) > > > I can reproduce several drops per minute by configuring the ixgbe > driver to use only 1 queue and then sending traffic from concurrent 8 > iperf processes. (You will need this hacky patch to even see the > drops with netstat, though: > http://people.freebsd.org/~rstone/patches/ixgbe_br_drops.diff) > > I am investigating fixing buf_ring by using acquire/release semantics > rather than load/store barriers. However, I note that this will > apparently be the second attempt to fix buf_ring, and I'm seriously > questioning whether this is worth the effort compared to the > simplicity of using a mutex. I'm not even convinced that a correct > lockless implementation will even be a performance win, given the > number of memory barriers that will apparently be necessary. > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" I was able to reproduce this bug. I've used 2-headed intel 10g card with twinax loopback and 4-core amd phenom. Here some results: - netperf works better for me than iperf in bug exposing, i was able to see steady (~20packet per second) packet drop (running 4 netperf clients simultaniously (UDP_STREAM test)). - i think you are right about the reason of those drops: there is a path in buf_ring_enqueue() where packet can be dropped without proper synchronization with another instance of buf_ring_enqueue(). - i've looked through buf_ring.h and here is my opinion: race is possible between buf_ring_enqueue() and buf_ring_dequeue_mc() which can lead to memory corruption. (though i didn't find any buf_ring_dequeue_mc() users in kernel, and this race is not possible on x86/amd64 archs due to their memory model, but it should be possible on arm/powerpc). I've attached patch which should fix all these problems. If there is no objection i can commit it. Unfortunatly i have no non-x86 hardware to play with so i can't test enqueue()/dequeue_mc() race myself. With this patch output drops are rare: in 10 tests (4 netperfs running UDP_STREAM test for 60 seconds) 8 tests had no drops, one had 2 drops and another one had 39 drops. These drops happens (i guess) when buf_ring is _really_ full. Without this patch same test gives me about 20 drops _per second_. -- Oleg. ================================================================ === Oleg Bulyzhin -- OBUL-RIPN -- OBUL-RIPE -- oleg@rinet.ru === ================================================================ --vkogqOf2sHV7VnPd Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="buf_ring.h.diff" --- sys/sys/buf_ring.h.orig 2013-03-21 13:08:09.000000000 +0400 +++ sys/sys/buf_ring.h 2013-12-26 21:34:57.000000000 +0400 @@ -64,8 +64,7 @@ static __inline int buf_ring_enqueue(struct buf_ring *br, void *buf) { - uint32_t prod_head, prod_next; - uint32_t cons_tail; + uint32_t prod_head, prod_next, cons_tail; #ifdef DEBUG_BUFRING int i; for (i = br->br_cons_head; i != br->br_prod_head; @@ -77,16 +76,20 @@ critical_enter(); do { prod_head = br->br_prod_head; + prod_next = (prod_head + 1) & br->br_prod_mask; cons_tail = br->br_cons_tail; - prod_next = (prod_head + 1) & br->br_prod_mask; - if (prod_next == cons_tail) { - br->br_drops++; - critical_exit(); - return (ENOBUFS); + rmb(); + if (prod_head == br->br_prod_head && + cons_tail == br->br_cons_tail) { + br->br_drops++; + critical_exit(); + return (ENOBUFS); + } + continue; } - } while (!atomic_cmpset_int(&br->br_prod_head, prod_head, prod_next)); + } while (!atomic_cmpset_acq_int(&br->br_prod_head, prod_head, prod_next)); #ifdef DEBUG_BUFRING if (br->br_ring[prod_head] != NULL) panic("dangling value in enqueue"); @@ -94,19 +97,13 @@ br->br_ring[prod_head] = buf; /* - * The full memory barrier also avoids that br_prod_tail store - * is reordered before the br_ring[prod_head] is full setup. - */ - mb(); - - /* * If there are other enqueues in progress * that preceeded us, we need to wait for them * to complete */ while (br->br_prod_tail != prod_head) cpu_spinwait(); - br->br_prod_tail = prod_next; + atomic_store_rel_int(&br->br_prod_tail, prod_next); critical_exit(); return (0); } @@ -119,37 +116,23 @@ buf_ring_dequeue_mc(struct buf_ring *br) { uint32_t cons_head, cons_next; - uint32_t prod_tail; void *buf; - int success; critical_enter(); do { cons_head = br->br_cons_head; - prod_tail = br->br_prod_tail; - cons_next = (cons_head + 1) & br->br_cons_mask; - - if (cons_head == prod_tail) { + + if (cons_head == br->br_prod_tail) { critical_exit(); return (NULL); } - - success = atomic_cmpset_int(&br->br_cons_head, cons_head, - cons_next); - } while (success == 0); + } while (!atomic_cmpset_acq_int(&br->br_cons_head, cons_head, cons_next)); buf = br->br_ring[cons_head]; #ifdef DEBUG_BUFRING br->br_ring[cons_head] = NULL; #endif - - /* - * The full memory barrier also avoids that br_ring[cons_read] - * load is reordered after br_cons_tail is set. - */ - mb(); - /* * If there are other dequeues in progress * that preceeded us, we need to wait for them @@ -158,7 +141,7 @@ while (br->br_cons_tail != cons_head) cpu_spinwait(); - br->br_cons_tail = cons_next; + atomic_store_rel_int(&br->br_cons_tail, cons_next); critical_exit(); return (buf); --vkogqOf2sHV7VnPd--