Date: Sun, 18 Nov 2012 09:24:01 -0800 (PST) From: Barney Cordoba <barney_cordoba@yahoo.com> To: John Baldwin <jhb@freebsd.org> Cc: freebsd-net@freebsd.org Subject: Re: Latency issues with buf_ring Message-ID: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> In-Reply-To: <201201191141.25998.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote:=0A=0A> Fro= m: John Baldwin <jhb@freebsd.org>=0A> Subject: Latency issues with buf_ring= =0A> To: net@freebsd.org=0A> Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep = Parhar" <np@freebsd.org>=0A> Date: Thursday, January 19, 2012, 11:41 AM=0A>= The current buf_ring usage in various=0A> NIC drivers has a race that can= =0A> result in really high packet latencies in some cases.=A0=0A> Specifica= lly,=0A> the common pattern in an if_transmit routine is to use a=0A> try-l= ock on=0A> the queue and if that fails enqueue the packet in the=0A> buf_ri= ng and=0A> return.=A0 The race, of course, is that the thread=0A> holding t= he lock=0A> might have just finished checking the buf_ring and found it=0A>= empty and=0A> be in the process of releasing the lock when the original=0A= > thread fails=0A> the try lock.=A0 If this happens, then the packet queued= =0A> by the first=0A> thread will be stalled until another thread tries to= =0A> transmit packets=0A> for that queue.=A0 Some drivers attempt to handle= this=0A> race (igb(4)=0A> schedules a task to kick the transmit queue if t= he try lock=0A> fails) and=0A> others don't (cxgb(4) doesn't handle it at a= ll).=A0 At=0A> work this race=0A> was triggered very often after upgrading = from 7 to 8 with=0A> bursty=0A> traffic and caused numerous problems, so it= is not a rare=0A> occurrence=0A> and needs to be addressed.=0A> =0A> (Note= , all patches mentioned are against 8)=0A> =0A> The first hack I tried to u= se was to simply always lock the=0A> queue after=0A> the drbr enqueue if th= e try lock failed and then drain the=0A> queue if=0A> needed (www.freebsd.o= rg/~jhb/patches/try_fail.patch).=A0=0A> While this fixed=0A> my latency pro= blems, it would seem that this breaks other=0A> workloads=0A> that the drbr= design is trying to optimize.=0A> =0A> After further hacking what I came u= p with was a variant of=0A> drbr_enqueue()=0A> that would atomically set a = 'pending' flag.=A0 During the=0A> enqueue operation.=0A> The first thread t= o fail the try lock sets this flag (it is=0A> told that it=0A> set the flag= by a new return value (EINPROGRESS) from the=0A> enqueue call).=0A> The pe= nding thread then explicitly clears the flag once it=0A> acquires the=0A> q= ueue lock.=A0 This should prevent multiple threads from=0A> stacking up on = the=0A> queue lock so that if multiple threads are dumping packets=0A> into= the ring=0A> concurrently all but two (the one draining the queue=0A> curr= ently and the=0A> one waiting for the lock) can continue to drain the=0A> q= ueue.=A0 One downside=0A> of this approach though is that each driver has t= o be=0A> changed to make=0A> an explicit call to clear the pending flag aft= er grabbing=0A> the queue lock=0A> if the try lock fails.=A0 This is what I= am currently=0A> running in production=0A> (www.freebsd.org/~jhb/patches/t= ry_fail3.patch).=0A> =0A> However, this still results in a lot of duplicate= d code in=0A> each driver=0A> that wants to support multiq.=A0 Several folk= s have=0A> expressed a desire=0A> to move in a direction where the stack ha= s explicit=0A> knowledge of=0A> transmit queues allowing us to hoist some o= f this duplicated=0A> code out=0A> of the drivers and up into the calling l= ayer.=A0 After=0A> discussing this a=0A> bit with Navdeep (np@), the approa= ch I am looking at is to=0A> alter the=0A> buf_ring code flow a bit to more= closely model the older=0A> code-flow=0A> with IFQ and if_start methods.= =A0 That is, have the=0A> if_transmit methods=0A> always enqueue each packe= t that arrives to the buf_ring and=0A> then to=0A> call an if_start-like me= thod that drains a specific transmit=0A> queue.=0A> This approach simplifie= s a fair bit of driver code and means=0A> we can=0A> potentially move the e= nqueue, etc. bits up into the calling=0A> layer and=0A> instead have driver= s provide the per-transmit queue start=0A> routine as=0A> the direct functi= on pointer to the upper layers ala=0A> if_start.=0A> =0A> However, we would= still need a way to close the latency=0A> race.=A0 I've=0A> attempted to d= o that by inverting my previous 'thread=0A> pending' flag.=0A> Instead, I m= ake the buf_ring store a 'busy' flag.=A0 This=0A> flag is=0A> managed by th= e single-consumer buf_ring dequeue method=0A> (that=0A> drbr_dequeue() uses= ).=A0 It is set to true when a packet=0A> is removed from=0A> the queue whi= le there are more packets pending.=A0=0A> Conversely, if there=0A> are no o= ther pending packets then it is set to false.=A0=0A> The assumption=0A> is = that once a thread starts draining the queue, it will not=0A> stop=0A> unti= l the queue is empty (or if it has to stop for some=0A> other reason=0A> su= ch as the transmit ring being full, the driver will=0A> restart draining=0A= > of the queue until it is empty, e.g. after it receives a=0A> transmit=0A>= completion interrupt).=A0 Now when the if_transmit=0A> routine enqueues th= e=0A> packet, it will get either a real error, 0 if the packet was=0A> enqu= eued=0A> and the queue was not idle, or EINPROGRESS if the packet was=0A> e= nqueued=0A> and the queue was busy.=A0 For the EINPROGRESS case the=0A> if_= transmit=0A> routine just returns success.=A0 For the 0 case it does a=0A> = blocking lock=0A> on the queue lock and calls the queue's start routine (no= te=0A> that this=0A> means that the busy flag is similar to the old OACTIVE= =0A> interface=0A> flag).=A0 This does mean that in some cases you may have= =0A> one thread that=0A> is sending what was the last packet in the buf_rin= g holding=0A> the lock=0A> when another thread blocks, and that the first t= hread will=0A> see the new=0A> packet when it loops back around so that the= second thread=0A> is wasting=0A> it's time spinning, but in the common cas= e I believe it will=0A> give the=0A> same parallelism as the current code.= =A0 OTOH, there is=0A> nothing to=0A> prevent multiple threads from "stacki= ng up" in the new=0A> approach.=A0 At=0A> least the try_fail3 patch ensured= only one thread at a time=0A> would ever=0A> potentially block on the queu= e lock.=0A> =0A> Another approach might be to replace the 'busy' flag with= =0A> the 'thread=0A> pending' flag from try_fail3.patch, but to clear the '= thread=0A> pending'=0A> flag anytime the dequeue method is called rather th= an using=0A> an=0A> explicit 'clear pending' method.=A0 (Hadn't thought of= =0A> that until=0A> writing this e-mail.)=A0 That would prevent multiple=0A= > threads from=0A> waiting on the queue lock perhaps.=0A> =0A> Note that th= e 'busy' approach (or the modification I=0A> mentioned above)=0A> does rely= on the assumption I stated above, i.e. once a=0A> driver starts=0A> draini= ng a queue, it will drain it until empty unless it=0A> hits an=0A> "error" = condition (link went down, transmit ring full,=0A> etc.).=A0 If it=0A> hits= an "error" condition, the driver is responsible for=0A> restarting=0A> tra= nsmit when the condition clears.=A0 I believe our=0A> drivers already=0A> w= ork this way now.=0A> =0A> The 'busy' patch is at http://www.freebsd.org/~j= hb/patches/drbr.patch=0A> =0A> -- =0A> John Baldwin=0A=0AQ1: Has this been = corrected?=0A=0AQ2: Are there any case studies or benchmarks for buf_ring, = or it is just=0Ablindly being used because someone claimed it was better an= d offered it=0Afor free? One of the points of locking is to avoid race cond= itions, so the fact that you have races in a supposed lock-less scheme seem= s more than just ironic.=0A=0A=0ABC
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1353259441.19423.YahooMailClassic>