Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Dec 2012 04:32:57 -0800 (PST)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        Andre Oppermann <oppermann@networx.ch>, Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-net@FreeBSD.org, Adrian Chadd <adrian@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>
Subject:   Re: Latency issues with buf_ring
Message-ID:  <1354710777.97879.YahooMailClassic@web121602.mail.ne1.yahoo.com>
In-Reply-To: <20121205112511.Q932@besplex.bde.org>

index | next in thread | previous in thread | raw e-mail



--- On Tue, 12/4/12, Bruce Evans <brde@optusnet.com.au> wrote:

> From: Bruce Evans <brde@optusnet.com.au>
> Subject: Re: Latency issues with buf_ring
> To: "Andre Oppermann" <oppermann@networx.ch>
> Cc: "Adrian Chadd" <adrian@FreeBSD.org>, "Barney Cordoba" <barney_cordoba@yahoo.com>, "John Baldwin" <jhb@FreeBSD.org>, freebsd-net@FreeBSD.org
> Date: Tuesday, December 4, 2012, 10:31 PM
> On Tue, 4 Dec 2012, Andre Oppermann
> wrote:
> 
> > For most if not all ethernet drivers from 100Mbit/s the
> TX DMA rings
> > are so large that buffering at the IFQ level doesn't
> make sense anymore
> > and only adds latency.
> 
> I found sort of the opposite for bge at 1Gbps.  Most or
> all bge NICs
> have a tx ring size of 512.  The ifq length is the tx
> ring size minus
> 1 (511).  I needed to expand this to imax(2 * tick / 4,
> 10000) to
> maximize pps.  This does bad things to latency and
> worse things to
> caching (512 buffers might fit in the L2 cache, but 10000
> buffers
> bust any reasonably cache as they are cycled through), but I
> only
> tried to optimize tx pps.
> 
> > So it could simply directly put everything into
> > the TX DMA and not even try to soft-queue.  If the
> TX DMA ring is full
> > ENOBUFS is returned instead of filling yet another
> queue.
> 
> That could work, but upper layers currently don't understand
> ENOBUFS
> at all, so it would work poorly now.  Also, 512 entries
> is not many,
> so even if upper layers understood ENOBUFS it is not easy
> for them to
> _always_ respond fast enough to keep the tx active, unless
> there are
> upstream buffers with many more than 512 entries. 
> There needs to be
> enough buffering somewhere so that the tx ring can be
> replenished
> almost instantly from the buffer, to handle the worst-case
> latency
> for the threads generatng new (unbuffered) packets.  At
> the line rate
> of ~1.5 Mpps for 1 Gbps, the maximum latency that can be
> covered by
> 512 entries is only 340 usec.
> 
> > However there
> > are ALTQ interactions and other mechanisms which have
> to be considered
> > too making it a bit more involved.
> 
> I didn't try to handle ALTQ or even optimize for TCP.
> 
> More details: to maximize pps, the main detail is to ensure
> that the tx
> ring never becomes empty.  The tx then transmits as
> fast as possible.
> This requires some watermark processing, but FreeBSD has
> almost none
> for tx rings.  The following normally happens for
> packet generators
> like ttcp and netsend:
> 
> - loop calling send() or sendto() until the tx ring (and
> also any
>   upstream buffers) fill up.  Then ENOBUFS is
> returned.
> 
> - watermark processing is broken in the user API at this
> point.  There
>   is no way for the application to wait for the ENOBUFS
> condition to
>   go away (select() and poll() don't work). 
> Applications use poor
>   workarounds:
> 
> - old (~1989) ttcp sleeps for 18 msec when send() returns
> ENOBUFS.  This
>   was barely good enough for 1 Mbps ethernet (line rate
> ~1500 pps is 27
>   per 18 msec, so IFQ_MAXLEN = 50 combined with just a
> 1-entry tx ring
>   provides a safety factor of about 2).  Expansion
> of the tx ring size to
>   512 makes this work with 10 Mbps ethernet too. 
> Expansion of the ifq
>   to 511 gives another factor of 2.  After losing
> the safety factor of 2,
>   we can now handle 40 Mbps ethernet, and are only a
> factor of 25 short
>   for 1 Gbps.  My hardware can't do line rate for
> small packets -- it
>   can only do 640 kpps.  Thus ttcp is only a
> factor of 11 short of
>   supporting the hardware at 1 Gbps.
> 
>   This assumes that sleeps of 18 msec are actually
> possible, which
>   they aren't with HZ = 100 giving a granularity of 10
> msec so that
>   sleep(18 msec) actually sleeps for an average of 23
> msec.  -current
>   uses the bad default of HZ = 1000.  With that
> sleep(18 msec) would
>   average 18.5 msec.  Of course, ttcp should sleep
> for more like 1
>   msec if that is possible.  Then the average
> sleep is 1.5 msec.  ttcp
>   can keep up with the hardware with that, and is only
> slightly behind
>   the hardware with the worst-case sleep of 2 msec
> (512+511 packets
>   generated every 2 msec is 511.5 kpps).
> 
>   I normally use old ttcp, except I modify it to sleep
> for 1 msec instead
>   of 18 in one version, and in another version I remove
> the sleep so that
>   it busy-waits in a loop that calls send() which
> almost always returns
>   ENOBUFS.  The latter wastes a lot of CPU, but is
> almost good enough
>   for throughput testing.
> 
> - newer ttcp tries to program the sleep time in
> microseconds.  This doesn't
>   really work, since the sleep granularity is normally
> at least a millisecond,
>   and even if it could be the 340 microseconds needed
> by bge with no ifq
>   (see above, and better divide the 340 by 2), then
> this is quite short
>   and would take almost as much CPU as
> busy-waiting.  I consider HZ = 1000
>   to be another form of polling/busy-waiting and don't
> use it except for
>   testing.
> 
> - netrate/netsend also uses a programmed sleep time. 
> This doesn't really
>   work, as above.  netsend also tries to limit its
> rate based on sleeping.
>   This is further from working, since even
> finer-grained sleeps are needed
>   to limit the rate accurately than to keep up with the
> maxium rate.
> 
> Watermark processing at the kernel level is not quite as
> broken.  It
> is mostly non-existend, but partly works, sort of
> accidentally.  The
> difference is now that there is a tx "eof" or "completion"
> interrupt
> which indicates the condition corresponding to the ENOBUFS
> condition
> going away, so that the kernel doesn't have to poll for
> this.  This
> is not really an "eof" interrupt (unless bge is programmed
> insanely,
> to interrupt only after the tx ring is completely
> empty).  It acts as
> primitive watermarking.  bge can be programmed to
> interrupt after
> having sent every N packets (strictly, after every N buffer
> descriptors,
> but for small packets these are the same).  When there
> are more than
> N packets to start, say M, this acts as a watermark at M-N
> packets.
> bge is normally misprogrammed with N = 10.  At the line
> rate of 1.5 Mpps,
> this asks for an interrupt rate of 150 kHz, which is far too
> high and
> is usually unreachable, so reaching the line rate is
> impossible due to
> the CPU load from the interrupts.  I use N = 384 or 256
> so that the
> interrupt rate is not the dominant limit.  However, N =
> 10 is better
> for latency and works under light loads.  It also
> reduces the amount
> of buffering needed.
> 
> The ifq works more as part of accidentally watermarking than
> as a buffer.
> It is the same size as the tx right (actually 1 smaller for
> bogus reasons),
> so it is not really useful as a buffer.  However, with
> no explicit
> watermarking, any separate buffer like the ifq provides a
> sort of
> watermark at the boundary between the buffers.  The
> usefulness of this
> would most obvious if the tx "eof" interrupt were actually
> for eof
> (perhaps that is what it was originally).  Then on the
> eof interrupt,
> there is no time at all to generate new packets, and the
> time when the
> tx is idle can be minimized by keeping pre-generated packets
> handy where
> the can be copied to the tx ring at tx "eof" interrupt
> time.  A buffer
> of about the same size as the tx ring (or maybe 1/4) the
> size, is enough
> for this.
> 
> OTOH, with bge misprogrammed to interrupt after every 10 tx
> packets, the
> ifq is useless for its watermark purposes.  The
> watermark is effectively
> in the tx ring, and very strangely placed there at 10 below
> the top
> (ring full).  Normally tx watermarks are placed near
> the bottom (ring
> empty).  They must not be placed too near the bottom,
> else there would
> not be enough time to replenish the ring between the time
> when the "eof"
> (really, the "watermark") interrupt is received and when the
> tx runs
> dry.  They should not be placed too near the top like
> they are in -current's
> bge, else the point of having a large tx ring is defeated
> and there are
> too many interrupts.  However, when they are placed
> near the top, latencency
> requirements are reduced.
> 
> I recently worked on buffering for sio and noticed similar
> related
> problems for tx watermarks.  Don't laugh -- serial i/o
> 1 character at
> a time at 3.686400 Mbps has much the same timing
> requirements as
> ethernet i/o 1 packet at a time at 1 Gbps.  Each serial
> character
> takes ~2.7 usec and each minimal ethernet packet takes ~0.67
> usec.
> With tx "ring" sizes of 128 and 512 respectively, the ring
> times for
> full to empty are 347 usec for serial i/o and 341 usec for
> ethernet i/o.
> Strangely, tx is harder than rx because:
> - perfection is possible and easier to measure for tx. 
> It consists of
>   just keeping at least 1 entry in the tx ring at all
> times.  Latency
>   must be kept below ~340 usec to have any chance of
> this.  This is not
>   so easy to achieve under _all_ loads.
> - for rx, you have an external source generating the
> packets, so you
>   don't have to worry about latency affecting the
> generators.
> - the need for watermark processing is better known for rx,
> since it
>   obviously doesn't work to generate the rx "eof"
> interrupt near the
>   top.
> The serial timing was actually harder to satisfy, because I
> worked on
> it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU,
> and even the
> 2GHz CPU couldn't keep up with line rate (so from full to
> empty takes
> 800 usec).
> 
> It turned out that the best position for the tx low
> watermark is about
> 1/4 or 1/2 from the bottom for both sio and bge.  It
> must be fairly
> high, else the latency requirements are not met.  In
> the middle is a
> good general position.  Although it apparently "wastes"
> half of the ring
> to make the latency requirements easier to meet (without
> very
> system-dependent tuning), the efficiency lost from this is
> reasonably
> small.
> 
> Bruce
> 

I'm sure that Bill Paul is a nice man, but referencing drivers that were
written from a template and never properly load tested doesn't really
illustrate anything. All of his drivers are functional but optimized for
nothing.

BC


help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1354710777.97879.YahooMailClassic>