Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Dec 2012 14:31:17 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Andre Oppermann <oppermann@networx.ch>
Cc:        Barney Cordoba <barney_cordoba@yahoo.com>, Adrian Chadd <adrian@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>, freebsd-net@FreeBSD.org
Subject:   Re: Latency issues with buf_ring
Message-ID:  <20121205112511.Q932@besplex.bde.org>
In-Reply-To: <50BE56C8.1030804@networx.ch>
References:  <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> <201212041108.17645.jhb@freebsd.org> <CAJ-Vmo=tFFkeK2uADMPuBrgX6wN_9TSjAgs0WKPCrEfyhkG6Pw@mail.gmail.com> <50BE56C8.1030804@networx.ch>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 4 Dec 2012, Andre Oppermann wrote:

> For most if not all ethernet drivers from 100Mbit/s the TX DMA rings
> are so large that buffering at the IFQ level doesn't make sense anymore
> and only adds latency.

I found sort of the opposite for bge at 1Gbps.  Most or all bge NICs
have a tx ring size of 512.  The ifq length is the tx ring size minus
1 (511).  I needed to expand this to imax(2 * tick / 4, 10000) to
maximize pps.  This does bad things to latency and worse things to
caching (512 buffers might fit in the L2 cache, but 10000 buffers
bust any reasonably cache as they are cycled through), but I only
tried to optimize tx pps.

> So it could simply directly put everything into
> the TX DMA and not even try to soft-queue.  If the TX DMA ring is full
> ENOBUFS is returned instead of filling yet another queue.

That could work, but upper layers currently don't understand ENOBUFS
at all, so it would work poorly now.  Also, 512 entries is not many,
so even if upper layers understood ENOBUFS it is not easy for them to
_always_ respond fast enough to keep the tx active, unless there are
upstream buffers with many more than 512 entries.  There needs to be
enough buffering somewhere so that the tx ring can be replenished
almost instantly from the buffer, to handle the worst-case latency
for the threads generatng new (unbuffered) packets.  At the line rate
of ~1.5 Mpps for 1 Gbps, the maximum latency that can be covered by
512 entries is only 340 usec.

> However there
> are ALTQ interactions and other mechanisms which have to be considered
> too making it a bit more involved.

I didn't try to handle ALTQ or even optimize for TCP.

More details: to maximize pps, the main detail is to ensure that the tx
ring never becomes empty.  The tx then transmits as fast as possible.
This requires some watermark processing, but FreeBSD has almost none
for tx rings.  The following normally happens for packet generators
like ttcp and netsend:

- loop calling send() or sendto() until the tx ring (and also any
   upstream buffers) fill up.  Then ENOBUFS is returned.

- watermark processing is broken in the user API at this point.  There
   is no way for the application to wait for the ENOBUFS condition to
   go away (select() and poll() don't work).  Applications use poor
   workarounds:

- old (~1989) ttcp sleeps for 18 msec when send() returns ENOBUFS.  This
   was barely good enough for 1 Mbps ethernet (line rate ~1500 pps is 27
   per 18 msec, so IFQ_MAXLEN = 50 combined with just a 1-entry tx ring
   provides a safety factor of about 2).  Expansion of the tx ring size to
   512 makes this work with 10 Mbps ethernet too.  Expansion of the ifq
   to 511 gives another factor of 2.  After losing the safety factor of 2,
   we can now handle 40 Mbps ethernet, and are only a factor of 25 short
   for 1 Gbps.  My hardware can't do line rate for small packets -- it
   can only do 640 kpps.  Thus ttcp is only a factor of 11 short of
   supporting the hardware at 1 Gbps.

   This assumes that sleeps of 18 msec are actually possible, which
   they aren't with HZ = 100 giving a granularity of 10 msec so that
   sleep(18 msec) actually sleeps for an average of 23 msec.  -current
   uses the bad default of HZ = 1000.  With that sleep(18 msec) would
   average 18.5 msec.  Of course, ttcp should sleep for more like 1
   msec if that is possible.  Then the average sleep is 1.5 msec.  ttcp
   can keep up with the hardware with that, and is only slightly behind
   the hardware with the worst-case sleep of 2 msec (512+511 packets
   generated every 2 msec is 511.5 kpps).

   I normally use old ttcp, except I modify it to sleep for 1 msec instead
   of 18 in one version, and in another version I remove the sleep so that
   it busy-waits in a loop that calls send() which almost always returns
   ENOBUFS.  The latter wastes a lot of CPU, but is almost good enough
   for throughput testing.

- newer ttcp tries to program the sleep time in microseconds.  This doesn't
   really work, since the sleep granularity is normally at least a millisecond,
   and even if it could be the 340 microseconds needed by bge with no ifq
   (see above, and better divide the 340 by 2), then this is quite short
   and would take almost as much CPU as busy-waiting.  I consider HZ = 1000
   to be another form of polling/busy-waiting and don't use it except for
   testing.

- netrate/netsend also uses a programmed sleep time.  This doesn't really
   work, as above.  netsend also tries to limit its rate based on sleeping.
   This is further from working, since even finer-grained sleeps are needed
   to limit the rate accurately than to keep up with the maxium rate.

Watermark processing at the kernel level is not quite as broken.  It
is mostly non-existend, but partly works, sort of accidentally.  The
difference is now that there is a tx "eof" or "completion" interrupt
which indicates the condition corresponding to the ENOBUFS condition
going away, so that the kernel doesn't have to poll for this.  This
is not really an "eof" interrupt (unless bge is programmed insanely,
to interrupt only after the tx ring is completely empty).  It acts as
primitive watermarking.  bge can be programmed to interrupt after
having sent every N packets (strictly, after every N buffer descriptors,
but for small packets these are the same).  When there are more than
N packets to start, say M, this acts as a watermark at M-N packets.
bge is normally misprogrammed with N = 10.  At the line rate of 1.5 Mpps,
this asks for an interrupt rate of 150 kHz, which is far too high and
is usually unreachable, so reaching the line rate is impossible due to
the CPU load from the interrupts.  I use N = 384 or 256 so that the
interrupt rate is not the dominant limit.  However, N = 10 is better
for latency and works under light loads.  It also reduces the amount
of buffering needed.

The ifq works more as part of accidentally watermarking than as a buffer.
It is the same size as the tx right (actually 1 smaller for bogus reasons),
so it is not really useful as a buffer.  However, with no explicit
watermarking, any separate buffer like the ifq provides a sort of
watermark at the boundary between the buffers.  The usefulness of this
would most obvious if the tx "eof" interrupt were actually for eof
(perhaps that is what it was originally).  Then on the eof interrupt,
there is no time at all to generate new packets, and the time when the
tx is idle can be minimized by keeping pre-generated packets handy where
the can be copied to the tx ring at tx "eof" interrupt time.  A buffer
of about the same size as the tx ring (or maybe 1/4) the size, is enough
for this.

OTOH, with bge misprogrammed to interrupt after every 10 tx packets, the
ifq is useless for its watermark purposes.  The watermark is effectively
in the tx ring, and very strangely placed there at 10 below the top
(ring full).  Normally tx watermarks are placed near the bottom (ring
empty).  They must not be placed too near the bottom, else there would
not be enough time to replenish the ring between the time when the "eof"
(really, the "watermark") interrupt is received and when the tx runs
dry.  They should not be placed too near the top like they are in -current's
bge, else the point of having a large tx ring is defeated and there are
too many interrupts.  However, when they are placed near the top, latencency
requirements are reduced.

I recently worked on buffering for sio and noticed similar related
problems for tx watermarks.  Don't laugh -- serial i/o 1 character at
a time at 3.686400 Mbps has much the same timing requirements as
ethernet i/o 1 packet at a time at 1 Gbps.  Each serial character
takes ~2.7 usec and each minimal ethernet packet takes ~0.67 usec.
With tx "ring" sizes of 128 and 512 respectively, the ring times for
full to empty are 347 usec for serial i/o and 341 usec for ethernet i/o.
Strangely, tx is harder than rx because:
- perfection is possible and easier to measure for tx.  It consists of
   just keeping at least 1 entry in the tx ring at all times.  Latency
   must be kept below ~340 usec to have any chance of this.  This is not
   so easy to achieve under _all_ loads.
- for rx, you have an external source generating the packets, so you
   don't have to worry about latency affecting the generators.
- the need for watermark processing is better known for rx, since it
   obviously doesn't work to generate the rx "eof" interrupt near the
   top.
The serial timing was actually harder to satisfy, because I worked on
it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU, and even the
2GHz CPU couldn't keep up with line rate (so from full to empty takes
800 usec).

It turned out that the best position for the tx low watermark is about
1/4 or 1/2 from the bottom for both sio and bge.  It must be fairly
high, else the latency requirements are not met.  In the middle is a
good general position.  Although it apparently "wastes" half of the ring
to make the latency requirements easier to meet (without very
system-dependent tuning), the efficiency lost from this is reasonably
small.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121205112511.Q932>