From owner-freebsd-net@FreeBSD.ORG Sun Nov 18 17:26:40 2012 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A854C51A for ; Sun, 18 Nov 2012 17:26:40 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm4.bullet.mail.ne1.yahoo.com (nm4.bullet.mail.ne1.yahoo.com [98.138.90.67]) by mx1.freebsd.org (Postfix) with ESMTP id 55F458FC08 for ; Sun, 18 Nov 2012 17:26:40 +0000 (UTC) Received: from [98.138.90.52] by nm4.bullet.mail.ne1.yahoo.com with NNFMP; 18 Nov 2012 17:24:01 -0000 Received: from [98.138.89.168] by tm5.bullet.mail.ne1.yahoo.com with NNFMP; 18 Nov 2012 17:24:01 -0000 Received: from [127.0.0.1] by omp1024.mail.ne1.yahoo.com with NNFMP; 18 Nov 2012 17:24:01 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 861664.27620.bm@omp1024.mail.ne1.yahoo.com Received: (qmail 28544 invoked by uid 60001); 18 Nov 2012 17:24:01 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1353259441; bh=dAA35egQaYINnNnGSmDWGKPyL3Hn5a5GJa3GwDp6IQ0=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=kst5qxznPi+bHSVl1GY62Ocb6NiQLavgtUUnwXlHUa1Nl777lZg6spwJ4eqExzpd72gNY2cygL+IL54o5Nq4sXsAvTZpeApZyqZf/+XGu9lAQCImgt2KoRqjy+gDo0L9cYRH7Pr3y+NU2EOs7bfFfkD65iInEmHe6TTfqrrzq9I= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=ZU/5uXAshpbxQkWb+ffV/I5+0aIW4wNjQDlHJ5EvyYSiKecTLqi4SP9bnsHsIR/X04oSrKGeXP+x+NUo50FZkNA+Acqsrp1IcwT5nMutgQ4kjVpY1suLKUh7A+53tATSvtmJDZEEW9sB4GYcUN/ERj80ERHoT4O3a6fkci2H0y4=; X-YMail-OSG: 6OuxEckVM1mAZLx9L5yv4hW4Gzto.K09GuZclyfKSzXqQ2W cPGh_EWCyRnHqcYDuDkVGGV5QdBVgxorR3ZM9wV1v53RtJmvfjfGkK6bv2X6 qtovQ2k7VsC3DA.0mPHJBn3fBn3EKDJBbuHhvTXTapvyd8yAIQtQPro2K8Yt X1gavqeg9tFh9c_iOkVYf1Nl1aJBMzFph0CbpJS0A13xkBN.W6_UTZx5ZbjP f.Pq4akA.oNkKZdUevl2edxN9e22_a7EaUyuyvfJ_GMQ5vYga83TMn_JdxG3 JER9i3aOjyRYBBicOvhqoXoB_I0Tw7s5_vtgK9fyS9k65Vpt5Sp_5dmaGNR3 k8BQWDD2aJjGhVesAXH4WRsWj7cnYt7tzybXbEUh7Hpwtn9qckFfCc9vQB.w wZfDAGuNEW.U3GM3kuyRW3V_zyO4xYH66qVTqN5kfY2BrdmExNZdbPtInGxZ em_23SLimeVEbJwfPzDmEzNwZs_7Y5QrUTtjW_UXlrbCjzLvzcXHv4w.CbTm S0TdWP9d0.iDSYa3d5A-- Received: from [174.48.128.27] by web121605.mail.ne1.yahoo.com via HTTP; Sun, 18 Nov 2012 09:24:01 PST X-Rocket-MIMEInfo: 001.001, CgotLS0gT24gVGh1LCAxLzE5LzEyLCBKb2huIEJhbGR3aW4gPGpoYkBmcmVlYnNkLm9yZz4gd3JvdGU6Cgo.IEZyb206IEpvaG4gQmFsZHdpbiA8amhiQGZyZWVic2Qub3JnPgo.IFN1YmplY3Q6IExhdGVuY3kgaXNzdWVzIHdpdGggYnVmX3JpbmcKPiBUbzogbmV0QGZyZWVic2Qub3JnCj4gQ2M6ICJFZCBNYXN0ZSIgPGVtYXN0ZUBmcmVlYnNkLm9yZz4sICJOYXZkZWVwIFBhcmhhciIgPG5wQGZyZWVic2Qub3JnPgo.IERhdGU6IFRodXJzZGF5LCBKYW51YXJ5IDE5LCAyMDEyLCAxMTo0MSBBTQo.IFRoZSBjdXIBMAEBAQE- X-Mailer: YahooMailClassic/15.0.8 YahooMailWebService/0.8.123.460 Message-ID: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> Date: Sun, 18 Nov 2012 09:24:01 -0800 (PST) From: Barney Cordoba Subject: Re: Latency issues with buf_ring To: John Baldwin In-Reply-To: <201201191141.25998.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Nov 2012 17:26:40 -0000 =0A=0A--- On Thu, 1/19/12, John Baldwin wrote:=0A=0A> Fro= m: John Baldwin =0A> Subject: Latency issues with buf_ring= =0A> To: net@freebsd.org=0A> Cc: "Ed Maste" , "Navdeep = Parhar" =0A> Date: Thursday, January 19, 2012, 11:41 AM=0A>= The current buf_ring usage in various=0A> NIC drivers has a race that can= =0A> result in really high packet latencies in some cases.=A0=0A> Specifica= lly,=0A> the common pattern in an if_transmit routine is to use a=0A> try-l= ock on=0A> the queue and if that fails enqueue the packet in the=0A> buf_ri= ng and=0A> return.=A0 The race, of course, is that the thread=0A> holding t= he lock=0A> might have just finished checking the buf_ring and found it=0A>= empty and=0A> be in the process of releasing the lock when the original=0A= > thread fails=0A> the try lock.=A0 If this happens, then the packet queued= =0A> by the first=0A> thread will be stalled until another thread tries to= =0A> transmit packets=0A> for that queue.=A0 Some drivers attempt to handle= this=0A> race (igb(4)=0A> schedules a task to kick the transmit queue if t= he try lock=0A> fails) and=0A> others don't (cxgb(4) doesn't handle it at a= ll).=A0 At=0A> work this race=0A> was triggered very often after upgrading = from 7 to 8 with=0A> bursty=0A> traffic and caused numerous problems, so it= is not a rare=0A> occurrence=0A> and needs to be addressed.=0A> =0A> (Note= , all patches mentioned are against 8)=0A> =0A> The first hack I tried to u= se was to simply always lock the=0A> queue after=0A> the drbr enqueue if th= e try lock failed and then drain the=0A> queue if=0A> needed (www.freebsd.o= rg/~jhb/patches/try_fail.patch).=A0=0A> While this fixed=0A> my latency pro= blems, it would seem that this breaks other=0A> workloads=0A> that the drbr= design is trying to optimize.=0A> =0A> After further hacking what I came u= p with was a variant of=0A> drbr_enqueue()=0A> that would atomically set a = 'pending' flag.=A0 During the=0A> enqueue operation.=0A> The first thread t= o fail the try lock sets this flag (it is=0A> told that it=0A> set the flag= by a new return value (EINPROGRESS) from the=0A> enqueue call).=0A> The pe= nding thread then explicitly clears the flag once it=0A> acquires the=0A> q= ueue lock.=A0 This should prevent multiple threads from=0A> stacking up on = the=0A> queue lock so that if multiple threads are dumping packets=0A> into= the ring=0A> concurrently all but two (the one draining the queue=0A> curr= ently and the=0A> one waiting for the lock) can continue to drain the=0A> q= ueue.=A0 One downside=0A> of this approach though is that each driver has t= o be=0A> changed to make=0A> an explicit call to clear the pending flag aft= er grabbing=0A> the queue lock=0A> if the try lock fails.=A0 This is what I= am currently=0A> running in production=0A> (www.freebsd.org/~jhb/patches/t= ry_fail3.patch).=0A> =0A> However, this still results in a lot of duplicate= d code in=0A> each driver=0A> that wants to support multiq.=A0 Several folk= s have=0A> expressed a desire=0A> to move in a direction where the stack ha= s explicit=0A> knowledge of=0A> transmit queues allowing us to hoist some o= f this duplicated=0A> code out=0A> of the drivers and up into the calling l= ayer.=A0 After=0A> discussing this a=0A> bit with Navdeep (np@), the approa= ch I am looking at is to=0A> alter the=0A> buf_ring code flow a bit to more= closely model the older=0A> code-flow=0A> with IFQ and if_start methods.= =A0 That is, have the=0A> if_transmit methods=0A> always enqueue each packe= t that arrives to the buf_ring and=0A> then to=0A> call an if_start-like me= thod that drains a specific transmit=0A> queue.=0A> This approach simplifie= s a fair bit of driver code and means=0A> we can=0A> potentially move the e= nqueue, etc. bits up into the calling=0A> layer and=0A> instead have driver= s provide the per-transmit queue start=0A> routine as=0A> the direct functi= on pointer to the upper layers ala=0A> if_start.=0A> =0A> However, we would= still need a way to close the latency=0A> race.=A0 I've=0A> attempted to d= o that by inverting my previous 'thread=0A> pending' flag.=0A> Instead, I m= ake the buf_ring store a 'busy' flag.=A0 This=0A> flag is=0A> managed by th= e single-consumer buf_ring dequeue method=0A> (that=0A> drbr_dequeue() uses= ).=A0 It is set to true when a packet=0A> is removed from=0A> the queue whi= le there are more packets pending.=A0=0A> Conversely, if there=0A> are no o= ther pending packets then it is set to false.=A0=0A> The assumption=0A> is = that once a thread starts draining the queue, it will not=0A> stop=0A> unti= l the queue is empty (or if it has to stop for some=0A> other reason=0A> su= ch as the transmit ring being full, the driver will=0A> restart draining=0A= > of the queue until it is empty, e.g. after it receives a=0A> transmit=0A>= completion interrupt).=A0 Now when the if_transmit=0A> routine enqueues th= e=0A> packet, it will get either a real error, 0 if the packet was=0A> enqu= eued=0A> and the queue was not idle, or EINPROGRESS if the packet was=0A> e= nqueued=0A> and the queue was busy.=A0 For the EINPROGRESS case the=0A> if_= transmit=0A> routine just returns success.=A0 For the 0 case it does a=0A> = blocking lock=0A> on the queue lock and calls the queue's start routine (no= te=0A> that this=0A> means that the busy flag is similar to the old OACTIVE= =0A> interface=0A> flag).=A0 This does mean that in some cases you may have= =0A> one thread that=0A> is sending what was the last packet in the buf_rin= g holding=0A> the lock=0A> when another thread blocks, and that the first t= hread will=0A> see the new=0A> packet when it loops back around so that the= second thread=0A> is wasting=0A> it's time spinning, but in the common cas= e I believe it will=0A> give the=0A> same parallelism as the current code.= =A0 OTOH, there is=0A> nothing to=0A> prevent multiple threads from "stacki= ng up" in the new=0A> approach.=A0 At=0A> least the try_fail3 patch ensured= only one thread at a time=0A> would ever=0A> potentially block on the queu= e lock.=0A> =0A> Another approach might be to replace the 'busy' flag with= =0A> the 'thread=0A> pending' flag from try_fail3.patch, but to clear the '= thread=0A> pending'=0A> flag anytime the dequeue method is called rather th= an using=0A> an=0A> explicit 'clear pending' method.=A0 (Hadn't thought of= =0A> that until=0A> writing this e-mail.)=A0 That would prevent multiple=0A= > threads from=0A> waiting on the queue lock perhaps.=0A> =0A> Note that th= e 'busy' approach (or the modification I=0A> mentioned above)=0A> does rely= on the assumption I stated above, i.e. once a=0A> driver starts=0A> draini= ng a queue, it will drain it until empty unless it=0A> hits an=0A> "error" = condition (link went down, transmit ring full,=0A> etc.).=A0 If it=0A> hits= an "error" condition, the driver is responsible for=0A> restarting=0A> tra= nsmit when the condition clears.=A0 I believe our=0A> drivers already=0A> w= ork this way now.=0A> =0A> The 'busy' patch is at http://www.freebsd.org/~j= hb/patches/drbr.patch=0A> =0A> -- =0A> John Baldwin=0A=0AQ1: Has this been = corrected?=0A=0AQ2: Are there any case studies or benchmarks for buf_ring, = or it is just=0Ablindly being used because someone claimed it was better an= d offered it=0Afor free? One of the points of locking is to avoid race cond= itions, so the fact that you have races in a supposed lock-less scheme seem= s more than just ironic.=0A=0A=0ABC