From owner-freebsd-net@FreeBSD.ORG  Wed Oct 22 21:21:51 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 39435230;
 Wed, 22 Oct 2014 21:21:51 +0000 (UTC)
Received: from exprod6og116.obsmtp.com (exprod6og116.obsmtp.com [64.18.1.37])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 93B876A7;
 Wed, 22 Oct 2014 21:21:50 +0000 (UTC)
Received: from brn1lxmailout01.verisign.com ([72.13.63.41]) (using TLSv1) by
 exprod6ob116.postini.com ([64.18.5.12]) with SMTP
 ID DSNKVEgf7d1QqPPPNOiW9D9oB4gX8fANYTWN@postini.com;
 Wed, 22 Oct 2014 14:21:50 PDT
Received: from brn1wnexcas01.vcorp.ad.vrsn.com
 (brn1wnexcas01.vcorp.ad.vrsn.com [10.173.152.205])
 by brn1lxmailout01.verisign.com (8.13.8/8.13.8) with ESMTP id s9MHiNpe025178
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL);
 Wed, 22 Oct 2014 13:44:23 -0400
Received: from BRN1WNEXMBX01.vcorp.ad.vrsn.com ([::1]) by
 brn1wnexcas01.vcorp.ad.vrsn.com ([::1]) with mapi id 14.03.0174.001; Wed, 22
 Oct 2014 13:44:23 -0400
From: "De La Gueronniere, Marc" <mdelagueronniere@verisign.com>
To: Oleg Bulyzhin <oleg@freebsd.org>, Ryan Stone <rysto32@gmail.com>
Subject: Re: buf_ring in HEAD is racy
Thread-Topic: buf_ring in HEAD is racy
Thread-Index: AQHO+IoP5aLTLgYkGEmAbDJX7D3/RJpnKbIAgdeJa4A=
Date: Wed, 22 Oct 2014 17:44:22 +0000
Message-ID: <D06D9912.1507F%mdelagueronniere@verisign.com>
References: <CAFMmRNyJpvZ0AewWr62w16=qKer+FNXUJJy0Qc=EBqMnUV3OyQ@mail.gmail.com>
 <20131226175410.GA15332@lath.rinet.ru>
In-Reply-To: <20131226175410.GA15332@lath.rinet.ru>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
user-agent: Microsoft-MacOutlook/14.3.9.131030
x-originating-ip: [10.173.152.4]
Content-Type: text/plain; charset="Windows-1252"
Content-ID: <30B9E098F3FE3440A5F6F9935657B0B5@verisign.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: freebsd-net <freebsd-net@freebsd.org>, "Charbon,
 Julien" <jcharbon@verisign.com>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Oct 2014 21:21:51 -0000


On 12/26/13, 6:54 PM, "Oleg Bulyzhin" <oleg@freebsd.org> wrote:

>On Sat, Dec 14, 2013 at 12:04:57AM -0500, Ryan Stone wrote:
>> I am seeing spurious output packet drops that appear to be due to
>> insufficient memory barriers in buf_ring.  I believe that this is the
>> scenario that I am seeing:
>>=20
>> 1) The buf_ring is empty, br_prod_head =3D br_cons_head =3D 0
>> 2) Thread 1 attempts to enqueue an mbuf on the buf_ring.  It fetches
>> br_prod_head (0) into a local variable called prod_head
>> 3) Thread 2 enqueues an mbuf on the buf_ring.  The sequence of events
>> is essentially:
>>=20
>> Thread 2 claims an index in the ring and atomically sets br_prod_head
>>(say to 1)
>> Thread 2 sets br_ring[1] =3D mbuf;
>> Thread 2 does a full memory barrier
>> Thread 2 updates br_prod_tail to 1
>>=20
>> 4) Thread 2 dequeues the packet from the buf_ring using the
>> single-consumer interface.  The sequence of events is essentialy:
>>=20
>> Thread 2 checks whether queue is empty (br_cons_head =3D=3D br_prod_tail=
),
>> this is false
>> Thread 2 sets br_cons_head to 1
>> Thread 2 grabs the mbuf from br_ring[1]
>> Thread 2 sets br_cons_tail to 1
>>=20
>> 5) Thread 1, which is still attempting to enqueue an mbuf on the ring.
>> fetches br_cons_tail (1) into a local variable called cons_tail.  It
>> sees cons_tail =3D=3D 1 but prod_head =3D=3D 0 and concludes that the ri=
ng is
>> full and drops the packet (incrementing br_drops unatomically, I might
>> add)
>>=20
>>=20
>> I can reproduce several drops per minute by configuring the ixgbe
>> driver to use only 1 queue and then sending traffic from concurrent 8
>> iperf processes.  (You will need this hacky patch to even see the
>> drops with netstat, though:
>> http://people.freebsd.org/~rstone/patches/ixgbe_br_drops.diff)
>>=20
>> I am investigating fixing buf_ring by using acquire/release semantics
>> rather than load/store barriers.  However, I note that this will
>> apparently be the second attempt to fix buf_ring, and I'm seriously
>> questioning whether this is worth the effort compared to the
>> simplicity of using a mutex.  I'm not even convinced that a correct
>> lockless implementation will even be a performance win, given the
>> number of memory barriers that will apparently be necessary.
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>
>I was able to reproduce this bug. I've used 2-headed intel 10g card with
>twinax loopback and 4-core amd phenom. Here some results:
>- netperf works better for me than iperf in bug exposing, i was able to
>  see steady (~20packet per second) packet drop (running 4 netperf clients
>  simultaniously (UDP_STREAM test)).
>
>- i think you are right about the reason of those drops: there is a path
>in=20
>  buf_ring_enqueue() where packet can be dropped without proper
>synchronization
>  with another instance of buf_ring_enqueue().
>
>- i've looked through buf_ring.h and here is my opinion: race is possible
>  between buf_ring_enqueue() and buf_ring_dequeue_mc() which can lead to
>  memory corruption. (though i didn't find any buf_ring_dequeue_mc()
>users in
>  kernel, and this race is not possible on x86/amd64 archs due to their
>  memory model, but it should be possible on arm/powerpc).
>
>I've attached patch which should fix all these problems. If there is no
>objection i can commit it. Unfortunatly i have no non-x86 hardware to play
>with so i can't test enqueue()/dequeue_mc() race myself.
>

Hi Oleg and Ryan,

We have run into the spurious drop issue too. I could not make sense of
seeing a single drop at a time every few seconds i.e. if a queue of 4k
element fills, likely more than one packet is going to be dropped once the
queue full condition is reached. So we investigated and came to the same
conclusion on buf_ring_enqueue being broken, at which point I found this
thread.

Are there any pending patches/investigations beside Oleg=B9s patch ? We are
definitely interested in helping with this, I just want to make sure that
we are not duplicating efforts.

I also suspect there are further problems with buf_ring.  A full wrap
around of the atomically swapped value is possible. I.e. the code thinks
it just atomically updated a head/tail index when in fact a full wrap
around occurred leading to undefined land. A relatively simple way to
avoid this is to only mask on ring array access, and to let the
head/tail/prod/cons indices overflow the array.

Cheers,

Marc de la Gueronniere =8B Verisign

(apologies for the following legal boilerplate)