From owner-freebsd-net@FreeBSD.ORG  Thu Dec 26 17:54:19 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 34E1AB83
 for <freebsd-net@freebsd.org>; Thu, 26 Dec 2013 17:54:19 +0000 (UTC)
Received: from lath.rinet.ru (lath.rinet.ru [195.54.192.90])
 by mx1.freebsd.org (Postfix) with ESMTP id 8FD5B1487
 for <freebsd-net@freebsd.org>; Thu, 26 Dec 2013 17:54:18 +0000 (UTC)
Received: by lath.rinet.ru (Postfix, from userid 222)
 id A3F278098; Thu, 26 Dec 2013 21:54:10 +0400 (MSK)
Date: Thu, 26 Dec 2013 21:54:10 +0400
From: Oleg Bulyzhin <oleg@FreeBSD.org>
To: Ryan Stone <rysto32@gmail.com>
Subject: Re: buf_ring in HEAD is racy
Message-ID: <20131226175410.GA15332@lath.rinet.ru>
References: <CAFMmRNyJpvZ0AewWr62w16=qKer+FNXUJJy0Qc=EBqMnUV3OyQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="vkogqOf2sHV7VnPd"
Content-Disposition: inline
In-Reply-To: <CAFMmRNyJpvZ0AewWr62w16=qKer+FNXUJJy0Qc=EBqMnUV3OyQ@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-net <freebsd-net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Dec 2013 17:54:19 -0000


--vkogqOf2sHV7VnPd
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Sat, Dec 14, 2013 at 12:04:57AM -0500, Ryan Stone wrote:
> I am seeing spurious output packet drops that appear to be due to
> insufficient memory barriers in buf_ring.  I believe that this is the
> scenario that I am seeing:
> 
> 1) The buf_ring is empty, br_prod_head = br_cons_head = 0
> 2) Thread 1 attempts to enqueue an mbuf on the buf_ring.  It fetches
> br_prod_head (0) into a local variable called prod_head
> 3) Thread 2 enqueues an mbuf on the buf_ring.  The sequence of events
> is essentially:
> 
> Thread 2 claims an index in the ring and atomically sets br_prod_head (say to 1)
> Thread 2 sets br_ring[1] = mbuf;
> Thread 2 does a full memory barrier
> Thread 2 updates br_prod_tail to 1
> 
> 4) Thread 2 dequeues the packet from the buf_ring using the
> single-consumer interface.  The sequence of events is essentialy:
> 
> Thread 2 checks whether queue is empty (br_cons_head == br_prod_tail),
> this is false
> Thread 2 sets br_cons_head to 1
> Thread 2 grabs the mbuf from br_ring[1]
> Thread 2 sets br_cons_tail to 1
> 
> 5) Thread 1, which is still attempting to enqueue an mbuf on the ring.
> fetches br_cons_tail (1) into a local variable called cons_tail.  It
> sees cons_tail == 1 but prod_head == 0 and concludes that the ring is
> full and drops the packet (incrementing br_drops unatomically, I might
> add)
> 
> 
> I can reproduce several drops per minute by configuring the ixgbe
> driver to use only 1 queue and then sending traffic from concurrent 8
> iperf processes.  (You will need this hacky patch to even see the
> drops with netstat, though:
> http://people.freebsd.org/~rstone/patches/ixgbe_br_drops.diff)
> 
> I am investigating fixing buf_ring by using acquire/release semantics
> rather than load/store barriers.  However, I note that this will
> apparently be the second attempt to fix buf_ring, and I'm seriously
> questioning whether this is worth the effort compared to the
> simplicity of using a mutex.  I'm not even convinced that a correct
> lockless implementation will even be a performance win, given the
> number of memory barriers that will apparently be necessary.
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

I was able to reproduce this bug. I've used 2-headed intel 10g card with
twinax loopback and 4-core amd phenom. Here some results:
- netperf works better for me than iperf in bug exposing, i was able to
  see steady (~20packet per second) packet drop (running 4 netperf clients
  simultaniously (UDP_STREAM test)).

- i think you are right about the reason of those drops: there is a path in 
  buf_ring_enqueue() where packet can be dropped without proper synchronization
  with another instance of buf_ring_enqueue().

- i've looked through buf_ring.h and here is my opinion: race is possible
  between buf_ring_enqueue() and buf_ring_dequeue_mc() which can lead to 
  memory corruption. (though i didn't find any buf_ring_dequeue_mc() users in
  kernel, and this race is not possible on x86/amd64 archs due to their
  memory model, but it should be possible on arm/powerpc).

I've attached patch which should fix all these problems. If there is no
objection i can commit it. Unfortunatly i have no non-x86 hardware to play
with so i can't test enqueue()/dequeue_mc() race myself.

With this patch output drops are rare: in 10 tests (4 netperfs running
UDP_STREAM test for 60 seconds) 8 tests had no drops, one had 2 drops and
another one had 39 drops. These drops happens (i guess) when buf_ring is
_really_ full. Without this patch same test gives me about
20 drops _per second_.

-- 
Oleg.

================================================================
=== Oleg Bulyzhin -- OBUL-RIPN -- OBUL-RIPE -- oleg@rinet.ru ===
================================================================


--vkogqOf2sHV7VnPd
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="buf_ring.h.diff"

--- sys/sys/buf_ring.h.orig	2013-03-21 13:08:09.000000000 +0400
+++ sys/sys/buf_ring.h	2013-12-26 21:34:57.000000000 +0400
@@ -64,8 +64,7 @@
 static __inline int
 buf_ring_enqueue(struct buf_ring *br, void *buf)
 {
-	uint32_t prod_head, prod_next;
-	uint32_t cons_tail;
+	uint32_t prod_head, prod_next, cons_tail;
 #ifdef DEBUG_BUFRING
 	int i;
 	for (i = br->br_cons_head; i != br->br_prod_head;
@@ -77,16 +76,20 @@
 	critical_enter();
 	do {
 		prod_head = br->br_prod_head;
+		prod_next = (prod_head + 1) & br->br_prod_mask;
 		cons_tail = br->br_cons_tail;
 
-		prod_next = (prod_head + 1) & br->br_prod_mask;
-		
 		if (prod_next == cons_tail) {
-			br->br_drops++;
-			critical_exit();
-			return (ENOBUFS);
+			rmb();
+			if (prod_head == br->br_prod_head &&
+			    cons_tail == br->br_cons_tail) {
+				br->br_drops++;
+				critical_exit();
+				return (ENOBUFS);
+			}
+			continue;
 		}
-	} while (!atomic_cmpset_int(&br->br_prod_head, prod_head, prod_next));
+	} while (!atomic_cmpset_acq_int(&br->br_prod_head, prod_head, prod_next));
 #ifdef DEBUG_BUFRING
 	if (br->br_ring[prod_head] != NULL)
 		panic("dangling value in enqueue");
@@ -94,19 +97,13 @@
 	br->br_ring[prod_head] = buf;
 
 	/*
-	 * The full memory barrier also avoids that br_prod_tail store
-	 * is reordered before the br_ring[prod_head] is full setup.
-	 */
-	mb();
-
-	/*
 	 * If there are other enqueues in progress
 	 * that preceeded us, we need to wait for them
 	 * to complete 
 	 */   
 	while (br->br_prod_tail != prod_head)
 		cpu_spinwait();
-	br->br_prod_tail = prod_next;
+	atomic_store_rel_int(&br->br_prod_tail, prod_next);
 	critical_exit();
 	return (0);
 }
@@ -119,37 +116,23 @@
 buf_ring_dequeue_mc(struct buf_ring *br)
 {
 	uint32_t cons_head, cons_next;
-	uint32_t prod_tail;
 	void *buf;
-	int success;
 
 	critical_enter();
 	do {
 		cons_head = br->br_cons_head;
-		prod_tail = br->br_prod_tail;
-
 		cons_next = (cons_head + 1) & br->br_cons_mask;
-		
-		if (cons_head == prod_tail) {
+
+		if (cons_head == br->br_prod_tail) {
 			critical_exit();
 			return (NULL);
 		}
-		
-		success = atomic_cmpset_int(&br->br_cons_head, cons_head,
-		    cons_next);
-	} while (success == 0);		
+	} while (!atomic_cmpset_acq_int(&br->br_cons_head, cons_head, cons_next));		
 
 	buf = br->br_ring[cons_head];
 #ifdef DEBUG_BUFRING
 	br->br_ring[cons_head] = NULL;
 #endif
-
-	/*
-	 * The full memory barrier also avoids that br_ring[cons_read]
-	 * load is reordered after br_cons_tail is set.
-	 */
-	mb();
-	
 	/*
 	 * If there are other dequeues in progress
 	 * that preceeded us, we need to wait for them
@@ -158,7 +141,7 @@
 	while (br->br_cons_tail != cons_head)
 		cpu_spinwait();
 
-	br->br_cons_tail = cons_next;
+	atomic_store_rel_int(&br->br_cons_tail, cons_next);
 	critical_exit();
 
 	return (buf);

--vkogqOf2sHV7VnPd--