FreeBSD Mail Archives

Date:      Sun, 18 Nov 2012 09:24:01 -0800 (PST)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Latency issues with buf_ring
Message-ID:  <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
In-Reply-To: <201201191141.25998.jhb@freebsd.org>

=0A=0A--- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote:=0A=0A> Fro=
m: John Baldwin <jhb@freebsd.org>=0A> Subject: Latency issues with buf_ring=
=0A> To: net@freebsd.org=0A> Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep =
Parhar" <np@freebsd.org>=0A> Date: Thursday, January 19, 2012, 11:41 AM=0A>=
 The current buf_ring usage in various=0A> NIC drivers has a race that can=
=0A> result in really high packet latencies in some cases.=A0=0A> Specifica=
lly,=0A> the common pattern in an if_transmit routine is to use a=0A> try-l=
ock on=0A> the queue and if that fails enqueue the packet in the=0A> buf_ri=
ng and=0A> return.=A0 The race, of course, is that the thread=0A> holding t=
he lock=0A> might have just finished checking the buf_ring and found it=0A>=
 empty and=0A> be in the process of releasing the lock when the original=0A=
> thread fails=0A> the try lock.=A0 If this happens, then the packet queued=
=0A> by the first=0A> thread will be stalled until another thread tries to=
=0A> transmit packets=0A> for that queue.=A0 Some drivers attempt to handle=
 this=0A> race (igb(4)=0A> schedules a task to kick the transmit queue if t=
he try lock=0A> fails) and=0A> others don't (cxgb(4) doesn't handle it at a=
ll).=A0 At=0A> work this race=0A> was triggered very often after upgrading =
from 7 to 8 with=0A> bursty=0A> traffic and caused numerous problems, so it=
 is not a rare=0A> occurrence=0A> and needs to be addressed.=0A> =0A> (Note=
, all patches mentioned are against 8)=0A> =0A> The first hack I tried to u=
se was to simply always lock the=0A> queue after=0A> the drbr enqueue if th=
e try lock failed and then drain the=0A> queue if=0A> needed (www.freebsd.o=
rg/~jhb/patches/try_fail.patch).=A0=0A> While this fixed=0A> my latency pro=
blems, it would seem that this breaks other=0A> workloads=0A> that the drbr=
 design is trying to optimize.=0A> =0A> After further hacking what I came u=
p with was a variant of=0A> drbr_enqueue()=0A> that would atomically set a =
'pending' flag.=A0 During the=0A> enqueue operation.=0A> The first thread t=
o fail the try lock sets this flag (it is=0A> told that it=0A> set the flag=
 by a new return value (EINPROGRESS) from the=0A> enqueue call).=0A> The pe=
nding thread then explicitly clears the flag once it=0A> acquires the=0A> q=
ueue lock.=A0 This should prevent multiple threads from=0A> stacking up on =
the=0A> queue lock so that if multiple threads are dumping packets=0A> into=
 the ring=0A> concurrently all but two (the one draining the queue=0A> curr=
ently and the=0A> one waiting for the lock) can continue to drain the=0A> q=
ueue.=A0 One downside=0A> of this approach though is that each driver has t=
o be=0A> changed to make=0A> an explicit call to clear the pending flag aft=
er grabbing=0A> the queue lock=0A> if the try lock fails.=A0 This is what I=
 am currently=0A> running in production=0A> (www.freebsd.org/~jhb/patches/t=
ry_fail3.patch).=0A> =0A> However, this still results in a lot of duplicate=
d code in=0A> each driver=0A> that wants to support multiq.=A0 Several folk=
s have=0A> expressed a desire=0A> to move in a direction where the stack ha=
s explicit=0A> knowledge of=0A> transmit queues allowing us to hoist some o=
f this duplicated=0A> code out=0A> of the drivers and up into the calling l=
ayer.=A0 After=0A> discussing this a=0A> bit with Navdeep (np@), the approa=
ch I am looking at is to=0A> alter the=0A> buf_ring code flow a bit to more=
 closely model the older=0A> code-flow=0A> with IFQ and if_start methods.=
=A0 That is, have the=0A> if_transmit methods=0A> always enqueue each packe=
t that arrives to the buf_ring and=0A> then to=0A> call an if_start-like me=
thod that drains a specific transmit=0A> queue.=0A> This approach simplifie=
s a fair bit of driver code and means=0A> we can=0A> potentially move the e=
nqueue, etc. bits up into the calling=0A> layer and=0A> instead have driver=
s provide the per-transmit queue start=0A> routine as=0A> the direct functi=
on pointer to the upper layers ala=0A> if_start.=0A> =0A> However, we would=
 still need a way to close the latency=0A> race.=A0 I've=0A> attempted to d=
o that by inverting my previous 'thread=0A> pending' flag.=0A> Instead, I m=
ake the buf_ring store a 'busy' flag.=A0 This=0A> flag is=0A> managed by th=
e single-consumer buf_ring dequeue method=0A> (that=0A> drbr_dequeue() uses=
).=A0 It is set to true when a packet=0A> is removed from=0A> the queue whi=
le there are more packets pending.=A0=0A> Conversely, if there=0A> are no o=
ther pending packets then it is set to false.=A0=0A> The assumption=0A> is =
that once a thread starts draining the queue, it will not=0A> stop=0A> unti=
l the queue is empty (or if it has to stop for some=0A> other reason=0A> su=
ch as the transmit ring being full, the driver will=0A> restart draining=0A=
> of the queue until it is empty, e.g. after it receives a=0A> transmit=0A>=
 completion interrupt).=A0 Now when the if_transmit=0A> routine enqueues th=
e=0A> packet, it will get either a real error, 0 if the packet was=0A> enqu=
eued=0A> and the queue was not idle, or EINPROGRESS if the packet was=0A> e=
nqueued=0A> and the queue was busy.=A0 For the EINPROGRESS case the=0A> if_=
transmit=0A> routine just returns success.=A0 For the 0 case it does a=0A> =
blocking lock=0A> on the queue lock and calls the queue's start routine (no=
te=0A> that this=0A> means that the busy flag is similar to the old OACTIVE=
=0A> interface=0A> flag).=A0 This does mean that in some cases you may have=
=0A> one thread that=0A> is sending what was the last packet in the buf_rin=
g holding=0A> the lock=0A> when another thread blocks, and that the first t=
hread will=0A> see the new=0A> packet when it loops back around so that the=
 second thread=0A> is wasting=0A> it's time spinning, but in the common cas=
e I believe it will=0A> give the=0A> same parallelism as the current code.=
=A0 OTOH, there is=0A> nothing to=0A> prevent multiple threads from "stacki=
ng up" in the new=0A> approach.=A0 At=0A> least the try_fail3 patch ensured=
 only one thread at a time=0A> would ever=0A> potentially block on the queu=
e lock.=0A> =0A> Another approach might be to replace the 'busy' flag with=
=0A> the 'thread=0A> pending' flag from try_fail3.patch, but to clear the '=
thread=0A> pending'=0A> flag anytime the dequeue method is called rather th=
an using=0A> an=0A> explicit 'clear pending' method.=A0 (Hadn't thought of=
=0A> that until=0A> writing this e-mail.)=A0 That would prevent multiple=0A=
> threads from=0A> waiting on the queue lock perhaps.=0A> =0A> Note that th=
e 'busy' approach (or the modification I=0A> mentioned above)=0A> does rely=
 on the assumption I stated above, i.e. once a=0A> driver starts=0A> draini=
ng a queue, it will drain it until empty unless it=0A> hits an=0A> "error" =
condition (link went down, transmit ring full,=0A> etc.).=A0 If it=0A> hits=
 an "error" condition, the driver is responsible for=0A> restarting=0A> tra=
nsmit when the condition clears.=A0 I believe our=0A> drivers already=0A> w=
ork this way now.=0A> =0A> The 'busy' patch is at http://www.freebsd.org/~j=
hb/patches/drbr.patch=0A> =0A> -- =0A> John Baldwin=0A=0AQ1: Has this been =
corrected?=0A=0AQ2: Are there any case studies or benchmarks for buf_ring, =
or it is just=0Ablindly being used because someone claimed it was better an=
d offered it=0Afor free? One of the points of locking is to avoid race cond=
itions, so the fact that you have races in a supposed lock-less scheme seem=
s more than just ironic.=0A=0A=0ABC

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1353259441.19423.YahooMailClassic>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation