From owner-freebsd-net@FreeBSD.ORG Wed Oct 22 21:21:51 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 39435230; Wed, 22 Oct 2014 21:21:51 +0000 (UTC) Received: from exprod6og116.obsmtp.com (exprod6og116.obsmtp.com [64.18.1.37]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 93B876A7; Wed, 22 Oct 2014 21:21:50 +0000 (UTC) Received: from brn1lxmailout01.verisign.com ([72.13.63.41]) (using TLSv1) by exprod6ob116.postini.com ([64.18.5.12]) with SMTP ID DSNKVEgf7d1QqPPPNOiW9D9oB4gX8fANYTWN@postini.com; Wed, 22 Oct 2014 14:21:50 PDT Received: from brn1wnexcas01.vcorp.ad.vrsn.com (brn1wnexcas01.vcorp.ad.vrsn.com [10.173.152.205]) by brn1lxmailout01.verisign.com (8.13.8/8.13.8) with ESMTP id s9MHiNpe025178 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Wed, 22 Oct 2014 13:44:23 -0400 Received: from BRN1WNEXMBX01.vcorp.ad.vrsn.com ([::1]) by brn1wnexcas01.vcorp.ad.vrsn.com ([::1]) with mapi id 14.03.0174.001; Wed, 22 Oct 2014 13:44:23 -0400 From: "De La Gueronniere, Marc" To: Oleg Bulyzhin , Ryan Stone Subject: Re: buf_ring in HEAD is racy Thread-Topic: buf_ring in HEAD is racy Thread-Index: AQHO+IoP5aLTLgYkGEmAbDJX7D3/RJpnKbIAgdeJa4A= Date: Wed, 22 Oct 2014 17:44:22 +0000 Message-ID: References: <20131226175410.GA15332@lath.rinet.ru> In-Reply-To: <20131226175410.GA15332@lath.rinet.ru> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.3.9.131030 x-originating-ip: [10.173.152.4] Content-Type: text/plain; charset="Windows-1252" Content-ID: <30B9E098F3FE3440A5F6F9935657B0B5@verisign.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: freebsd-net , "Charbon, Julien" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Oct 2014 21:21:51 -0000 On 12/26/13, 6:54 PM, "Oleg Bulyzhin" wrote: >On Sat, Dec 14, 2013 at 12:04:57AM -0500, Ryan Stone wrote: >> I am seeing spurious output packet drops that appear to be due to >> insufficient memory barriers in buf_ring. I believe that this is the >> scenario that I am seeing: >>=20 >> 1) The buf_ring is empty, br_prod_head =3D br_cons_head =3D 0 >> 2) Thread 1 attempts to enqueue an mbuf on the buf_ring. It fetches >> br_prod_head (0) into a local variable called prod_head >> 3) Thread 2 enqueues an mbuf on the buf_ring. The sequence of events >> is essentially: >>=20 >> Thread 2 claims an index in the ring and atomically sets br_prod_head >>(say to 1) >> Thread 2 sets br_ring[1] =3D mbuf; >> Thread 2 does a full memory barrier >> Thread 2 updates br_prod_tail to 1 >>=20 >> 4) Thread 2 dequeues the packet from the buf_ring using the >> single-consumer interface. The sequence of events is essentialy: >>=20 >> Thread 2 checks whether queue is empty (br_cons_head =3D=3D br_prod_tail= ), >> this is false >> Thread 2 sets br_cons_head to 1 >> Thread 2 grabs the mbuf from br_ring[1] >> Thread 2 sets br_cons_tail to 1 >>=20 >> 5) Thread 1, which is still attempting to enqueue an mbuf on the ring. >> fetches br_cons_tail (1) into a local variable called cons_tail. It >> sees cons_tail =3D=3D 1 but prod_head =3D=3D 0 and concludes that the ri= ng is >> full and drops the packet (incrementing br_drops unatomically, I might >> add) >>=20 >>=20 >> I can reproduce several drops per minute by configuring the ixgbe >> driver to use only 1 queue and then sending traffic from concurrent 8 >> iperf processes. (You will need this hacky patch to even see the >> drops with netstat, though: >> http://people.freebsd.org/~rstone/patches/ixgbe_br_drops.diff) >>=20 >> I am investigating fixing buf_ring by using acquire/release semantics >> rather than load/store barriers. However, I note that this will >> apparently be the second attempt to fix buf_ring, and I'm seriously >> questioning whether this is worth the effort compared to the >> simplicity of using a mutex. I'm not even convinced that a correct >> lockless implementation will even be a performance win, given the >> number of memory barriers that will apparently be necessary. >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > >I was able to reproduce this bug. I've used 2-headed intel 10g card with >twinax loopback and 4-core amd phenom. Here some results: >- netperf works better for me than iperf in bug exposing, i was able to > see steady (~20packet per second) packet drop (running 4 netperf clients > simultaniously (UDP_STREAM test)). > >- i think you are right about the reason of those drops: there is a path >in=20 > buf_ring_enqueue() where packet can be dropped without proper >synchronization > with another instance of buf_ring_enqueue(). > >- i've looked through buf_ring.h and here is my opinion: race is possible > between buf_ring_enqueue() and buf_ring_dequeue_mc() which can lead to > memory corruption. (though i didn't find any buf_ring_dequeue_mc() >users in > kernel, and this race is not possible on x86/amd64 archs due to their > memory model, but it should be possible on arm/powerpc). > >I've attached patch which should fix all these problems. If there is no >objection i can commit it. Unfortunatly i have no non-x86 hardware to play >with so i can't test enqueue()/dequeue_mc() race myself. > Hi Oleg and Ryan, We have run into the spurious drop issue too. I could not make sense of seeing a single drop at a time every few seconds i.e. if a queue of 4k element fills, likely more than one packet is going to be dropped once the queue full condition is reached. So we investigated and came to the same conclusion on buf_ring_enqueue being broken, at which point I found this thread. Are there any pending patches/investigations beside Oleg=B9s patch ? We are definitely interested in helping with this, I just want to make sure that we are not duplicating efforts. I also suspect there are further problems with buf_ring. A full wrap around of the atomically swapped value is possible. I.e. the code thinks it just atomically updated a head/tail index when in fact a full wrap around occurred leading to undefined land. A relatively simple way to avoid this is to only mask on ring array access, and to let the head/tail/prod/cons indices overflow the array. Cheers, Marc de la Gueronniere =8B Verisign (apologies for the following legal boilerplate)