Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Dec 2012 04:32:57 -0800 (PST)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        Andre Oppermann <oppermann@networx.ch>, Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-net@FreeBSD.org, Adrian Chadd <adrian@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>
Subject:   Re: Latency issues with buf_ring
Message-ID:  <1354710777.97879.YahooMailClassic@web121602.mail.ne1.yahoo.com>
In-Reply-To: <20121205112511.Q932@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Tue, 12/4/12, Bruce Evans <brde@optusnet.com.au> wrote:=0A=0A>=
 From: Bruce Evans <brde@optusnet.com.au>=0A> Subject: Re: Latency issues w=
ith buf_ring=0A> To: "Andre Oppermann" <oppermann@networx.ch>=0A> Cc: "Adri=
an Chadd" <adrian@FreeBSD.org>, "Barney Cordoba" <barney_cordoba@yahoo.com>=
, "John Baldwin" <jhb@FreeBSD.org>, freebsd-net@FreeBSD.org=0A> Date: Tuesd=
ay, December 4, 2012, 10:31 PM=0A> On Tue, 4 Dec 2012, Andre Oppermann=0A> =
wrote:=0A> =0A> > For most if not all ethernet drivers from 100Mbit/s the=
=0A> TX DMA rings=0A> > are so large that buffering at the IFQ level doesn'=
t=0A> make sense anymore=0A> > and only adds latency.=0A> =0A> I found sort=
 of the opposite for bge at 1Gbps.=A0 Most or=0A> all bge NICs=0A> have a t=
x ring size of 512.=A0 The ifq length is the tx=0A> ring size minus=0A> 1 (=
511).=A0 I needed to expand this to imax(2 * tick / 4,=0A> 10000) to=0A> ma=
ximize pps.=A0 This does bad things to latency and=0A> worse things to=0A> =
caching (512 buffers might fit in the L2 cache, but 10000=0A> buffers=0A> b=
ust any reasonably cache as they are cycled through), but I=0A> only=0A> tr=
ied to optimize tx pps.=0A> =0A> > So it could simply directly put everythi=
ng into=0A> > the TX DMA and not even try to soft-queue.=A0 If the=0A> TX D=
MA ring is full=0A> > ENOBUFS is returned instead of filling yet another=0A=
> queue.=0A> =0A> That could work, but upper layers currently don't underst=
and=0A> ENOBUFS=0A> at all, so it would work poorly now.=A0 Also, 512 entri=
es=0A> is not many,=0A> so even if upper layers understood ENOBUFS it is no=
t easy=0A> for them to=0A> _always_ respond fast enough to keep the tx acti=
ve, unless=0A> there are=0A> upstream buffers with many more than 512 entri=
es.=A0=0A> There needs to be=0A> enough buffering somewhere so that the tx =
ring can be=0A> replenished=0A> almost instantly from the buffer, to handle=
 the worst-case=0A> latency=0A> for the threads generatng new (unbuffered) =
packets.=A0 At=0A> the line rate=0A> of ~1.5 Mpps for 1 Gbps, the maximum l=
atency that can be=0A> covered by=0A> 512 entries is only 340 usec.=0A> =0A=
> > However there=0A> > are ALTQ interactions and other mechanisms which ha=
ve=0A> to be considered=0A> > too making it a bit more involved.=0A> =0A> I=
 didn't try to handle ALTQ or even optimize for TCP.=0A> =0A> More details:=
 to maximize pps, the main detail is to ensure=0A> that the tx=0A> ring nev=
er becomes empty.=A0 The tx then transmits as=0A> fast as possible.=0A> Thi=
s requires some watermark processing, but FreeBSD has=0A> almost none=0A> f=
or tx rings.=A0 The following normally happens for=0A> packet generators=0A=
> like ttcp and netsend:=0A> =0A> - loop calling send() or sendto() until t=
he tx ring (and=0A> also any=0A> =A0 upstream buffers) fill up.=A0 Then ENO=
BUFS is=0A> returned.=0A> =0A> - watermark processing is broken in the user=
 API at this=0A> point.=A0 There=0A> =A0 is no way for the application to w=
ait for the ENOBUFS=0A> condition to=0A> =A0 go away (select() and poll() d=
on't work).=A0=0A> Applications use poor=0A> =A0 workarounds:=0A> =0A> - ol=
d (~1989) ttcp sleeps for 18 msec when send() returns=0A> ENOBUFS.=A0 This=
=0A> =A0 was barely good enough for 1 Mbps ethernet (line rate=0A> ~1500 pp=
s is 27=0A> =A0 per 18 msec, so IFQ_MAXLEN =3D 50 combined with just a=0A> =
1-entry tx ring=0A> =A0 provides a safety factor of about 2).=A0 Expansion=
=0A> of the tx ring size to=0A> =A0 512 makes this work with 10 Mbps ethern=
et too.=A0=0A> Expansion of the ifq=0A> =A0 to 511 gives another factor of =
2.=A0 After losing=0A> the safety factor of 2,=0A> =A0 we can now handle 40=
 Mbps ethernet, and are only a=0A> factor of 25 short=0A> =A0 for 1 Gbps.=
=A0 My hardware can't do line rate for=0A> small packets -- it=0A> =A0 can =
only do 640 kpps.=A0 Thus ttcp is only a=0A> factor of 11 short of=0A> =A0 =
supporting the hardware at 1 Gbps.=0A> =0A> =A0 This assumes that sleeps of=
 18 msec are actually=0A> possible, which=0A> =A0 they aren't with HZ =3D 1=
00 giving a granularity of 10=0A> msec so that=0A> =A0 sleep(18 msec) actua=
lly sleeps for an average of 23=0A> msec.=A0 -current=0A> =A0 uses the bad =
default of HZ =3D 1000.=A0 With that=0A> sleep(18 msec) would=0A> =A0 avera=
ge 18.5 msec.=A0 Of course, ttcp should sleep=0A> for more like 1=0A> =A0 m=
sec if that is possible.=A0 Then the average=0A> sleep is 1.5 msec.=A0 ttcp=
=0A> =A0 can keep up with the hardware with that, and is only=0A> slightly =
behind=0A> =A0 the hardware with the worst-case sleep of 2 msec=0A> (512+51=
1 packets=0A> =A0 generated every 2 msec is 511.5 kpps).=0A> =0A> =A0 I nor=
mally use old ttcp, except I modify it to sleep=0A> for 1 msec instead=0A> =
=A0 of 18 in one version, and in another version I remove=0A> the sleep so =
that=0A> =A0 it busy-waits in a loop that calls send() which=0A> almost alw=
ays returns=0A> =A0 ENOBUFS.=A0 The latter wastes a lot of CPU, but is=0A> =
almost good enough=0A> =A0 for throughput testing.=0A> =0A> - newer ttcp tr=
ies to program the sleep time in=0A> microseconds.=A0 This doesn't=0A> =A0 =
really work, since the sleep granularity is normally=0A> at least a millise=
cond,=0A> =A0 and even if it could be the 340 microseconds needed=0A> by bg=
e with no ifq=0A> =A0 (see above, and better divide the 340 by 2), then=0A>=
 this is quite short=0A> =A0 and would take almost as much CPU as=0A> busy-=
waiting.=A0 I consider HZ =3D 1000=0A> =A0 to be another form of polling/bu=
sy-waiting and don't=0A> use it except for=0A> =A0 testing.=0A> =0A> - netr=
ate/netsend also uses a programmed sleep time.=A0=0A> This doesn't really=
=0A> =A0 work, as above.=A0 netsend also tries to limit its=0A> rate based =
on sleeping.=0A> =A0 This is further from working, since even=0A> finer-gra=
ined sleeps are needed=0A> =A0 to limit the rate accurately than to keep up=
 with the=0A> maxium rate.=0A> =0A> Watermark processing at the kernel leve=
l is not quite as=0A> broken.=A0 It=0A> is mostly non-existend, but partly =
works, sort of=0A> accidentally.=A0 The=0A> difference is now that there is=
 a tx "eof" or "completion"=0A> interrupt=0A> which indicates the condition=
 corresponding to the ENOBUFS=0A> condition=0A> going away, so that the ker=
nel doesn't have to poll for=0A> this.=A0 This=0A> is not really an "eof" i=
nterrupt (unless bge is programmed=0A> insanely,=0A> to interrupt only afte=
r the tx ring is completely=0A> empty).=A0 It acts as=0A> primitive waterma=
rking.=A0 bge can be programmed to=0A> interrupt after=0A> having sent ever=
y N packets (strictly, after every N buffer=0A> descriptors,=0A> but for sm=
all packets these are the same).=A0 When there=0A> are more than=0A> N pack=
ets to start, say M, this acts as a watermark at M-N=0A> packets.=0A> bge i=
s normally misprogrammed with N =3D 10.=A0 At the line=0A> rate of 1.5 Mpps=
,=0A> this asks for an interrupt rate of 150 kHz, which is far too=0A> high=
 and=0A> is usually unreachable, so reaching the line rate is=0A> impossibl=
e due to=0A> the CPU load from the interrupts.=A0 I use N =3D 384 or 256=0A=
> so that the=0A> interrupt rate is not the dominant limit.=A0 However, N =
=3D=0A> 10 is better=0A> for latency and works under light loads.=A0 It als=
o=0A> reduces the amount=0A> of buffering needed.=0A> =0A> The ifq works mo=
re as part of accidentally watermarking than=0A> as a buffer.=0A> It is the=
 same size as the tx right (actually 1 smaller for=0A> bogus reasons),=0A> =
so it is not really useful as a buffer.=A0 However, with=0A> no explicit=0A=
> watermarking, any separate buffer like the ifq provides a=0A> sort of=0A>=
 watermark at the boundary between the buffers.=A0 The=0A> usefulness of th=
is=0A> would most obvious if the tx "eof" interrupt were actually=0A> for e=
of=0A> (perhaps that is what it was originally).=A0 Then on the=0A> eof int=
errupt,=0A> there is no time at all to generate new packets, and the=0A> ti=
me when the=0A> tx is idle can be minimized by keeping pre-generated packet=
s=0A> handy where=0A> the can be copied to the tx ring at tx "eof" interrup=
t=0A> time.=A0 A buffer=0A> of about the same size as the tx ring (or maybe=
 1/4) the=0A> size, is enough=0A> for this.=0A> =0A> OTOH, with bge misprog=
rammed to interrupt after every 10 tx=0A> packets, the=0A> ifq is useless f=
or its watermark purposes.=A0 The=0A> watermark is effectively=0A> in the t=
x ring, and very strangely placed there at 10 below=0A> the top=0A> (ring f=
ull).=A0 Normally tx watermarks are placed near=0A> the bottom (ring=0A> em=
pty).=A0 They must not be placed too near the bottom,=0A> else there would=
=0A> not be enough time to replenish the ring between the time=0A> when the=
 "eof"=0A> (really, the "watermark") interrupt is received and when the=0A>=
 tx runs=0A> dry.=A0 They should not be placed too near the top like=0A> th=
ey are in -current's=0A> bge, else the point of having a large tx ring is d=
efeated=0A> and there are=0A> too many interrupts.=A0 However, when they ar=
e placed=0A> near the top, latencency=0A> requirements are reduced.=0A> =0A=
> I recently worked on buffering for sio and noticed similar=0A> related=0A=
> problems for tx watermarks.=A0 Don't laugh -- serial i/o=0A> 1 character =
at=0A> a time at 3.686400 Mbps has much the same timing=0A> requirements as=
=0A> ethernet i/o 1 packet at a time at 1 Gbps.=A0 Each serial=0A> characte=
r=0A> takes ~2.7 usec and each minimal ethernet packet takes ~0.67=0A> usec=
.=0A> With tx "ring" sizes of 128 and 512 respectively, the ring=0A> times =
for=0A> full to empty are 347 usec for serial i/o and 341 usec for=0A> ethe=
rnet i/o.=0A> Strangely, tx is harder than rx because:=0A> - perfection is =
possible and easier to measure for tx.=A0=0A> It consists of=0A> =A0 just k=
eeping at least 1 entry in the tx ring at all=0A> times.=A0 Latency=0A> =A0=
 must be kept below ~340 usec to have any chance of=0A> this.=A0 This is no=
t=0A> =A0 so easy to achieve under _all_ loads.=0A> - for rx, you have an e=
xternal source generating the=0A> packets, so you=0A> =A0 don't have to wor=
ry about latency affecting the=0A> generators.=0A> - the need for watermark=
 processing is better known for rx,=0A> since it=0A> =A0 obviously doesn't =
work to generate the rx "eof"=0A> interrupt near the=0A> =A0 top.=0A> The s=
erial timing was actually harder to satisfy, because I=0A> worked on=0A> it=
 on a 366 MHz CPU while I worked on bge on a 2 GHz CPU,=0A> and even the=0A=
> 2GHz CPU couldn't keep up with line rate (so from full to=0A> empty takes=
=0A> 800 usec).=0A> =0A> It turned out that the best position for the tx lo=
w=0A> watermark is about=0A> 1/4 or 1/2 from the bottom for both sio and bg=
e.=A0 It=0A> must be fairly=0A> high, else the latency requirements are not=
 met.=A0 In=0A> the middle is a=0A> good general position.=A0 Although it a=
pparently "wastes"=0A> half of the ring=0A> to make the latency requirement=
s easier to meet (without=0A> very=0A> system-dependent tuning), the effici=
ency lost from this is=0A> reasonably=0A> small.=0A> =0A> Bruce=0A> =0A=0AI=
'm sure that Bill Paul is a nice man, but referencing drivers that were=0Aw=
ritten from a template and never properly load tested doesn't really=0Aillu=
strate anything. All of his drivers are functional but optimized for=0Anoth=
ing.=0A=0ABC



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1354710777.97879.YahooMailClassic>