FreeBSD Mail Archives

Date:      Wed, 5 Dec 2012 04:32:57 -0800 (PST)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        Andre Oppermann <oppermann@networx.ch>, Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-net@FreeBSD.org, Adrian Chadd <adrian@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>
Subject:   Re: Latency issues with buf_ring
Message-ID:  <1354710777.97879.YahooMailClassic@web121602.mail.ne1.yahoo.com>
In-Reply-To: <20121205112511.Q932@besplex.bde.org>

index | next in thread | previous in thread | raw e-mail




--- On Tue, 12/4/12, Bruce Evans <brde@optusnet.com.au> wrote:

> From: Bruce Evans <brde@optusnet.com.au>
> Subject: Re: Latency issues with buf_ring
> To: "Andre Oppermann" <oppermann@networx.ch>
> Cc: "Adrian Chadd" <adrian@FreeBSD.org>, "Barney Cordoba" <barney_cordoba@yahoo.com>, "John Baldwin" <jhb@FreeBSD.org>, freebsd-net@FreeBSD.org
> Date: Tuesday, December 4, 2012, 10:31 PM
> On Tue, 4 Dec 2012, Andre Oppermann
> wrote:
> 
> > For most if not all ethernet drivers from 100Mbit/s the
> TX DMA rings
> > are so large that buffering at the IFQ level doesn't
> make sense anymore
> > and only adds latency.
> 
> I found sort of the opposite for bge at 1Gbps.� Most or
> all bge NICs
> have a tx ring size of 512.� The ifq length is the tx
> ring size minus
> 1 (511).� I needed to expand this to imax(2 * tick / 4,
> 10000) to
> maximize pps.� This does bad things to latency and
> worse things to
> caching (512 buffers might fit in the L2 cache, but 10000
> buffers
> bust any reasonably cache as they are cycled through), but I
> only
> tried to optimize tx pps.
> 
> > So it could simply directly put everything into
> > the TX DMA and not even try to soft-queue.� If the
> TX DMA ring is full
> > ENOBUFS is returned instead of filling yet another
> queue.
> 
> That could work, but upper layers currently don't understand
> ENOBUFS
> at all, so it would work poorly now.� Also, 512 entries
> is not many,
> so even if upper layers understood ENOBUFS it is not easy
> for them to
> _always_ respond fast enough to keep the tx active, unless
> there are
> upstream buffers with many more than 512 entries.�
> There needs to be
> enough buffering somewhere so that the tx ring can be
> replenished
> almost instantly from the buffer, to handle the worst-case
> latency
> for the threads generatng new (unbuffered) packets.� At
> the line rate
> of ~1.5 Mpps for 1 Gbps, the maximum latency that can be
> covered by
> 512 entries is only 340 usec.
> 
> > However there
> > are ALTQ interactions and other mechanisms which have
> to be considered
> > too making it a bit more involved.
> 
> I didn't try to handle ALTQ or even optimize for TCP.
> 
> More details: to maximize pps, the main detail is to ensure
> that the tx
> ring never becomes empty.� The tx then transmits as
> fast as possible.
> This requires some watermark processing, but FreeBSD has
> almost none
> for tx rings.� The following normally happens for
> packet generators
> like ttcp and netsend:
> 
> - loop calling send() or sendto() until the tx ring (and
> also any
> � upstream buffers) fill up.� Then ENOBUFS is
> returned.
> 
> - watermark processing is broken in the user API at this
> point.� There
> � is no way for the application to wait for the ENOBUFS
> condition to
> � go away (select() and poll() don't work).�
> Applications use poor
> � workarounds:
> 
> - old (~1989) ttcp sleeps for 18 msec when send() returns
> ENOBUFS.� This
> � was barely good enough for 1 Mbps ethernet (line rate
> ~1500 pps is 27
> � per 18 msec, so IFQ_MAXLEN = 50 combined with just a
> 1-entry tx ring
> � provides a safety factor of about 2).� Expansion
> of the tx ring size to
> � 512 makes this work with 10 Mbps ethernet too.�
> Expansion of the ifq
> � to 511 gives another factor of 2.� After losing
> the safety factor of 2,
> � we can now handle 40 Mbps ethernet, and are only a
> factor of 25 short
> � for 1 Gbps.� My hardware can't do line rate for
> small packets -- it
> � can only do 640 kpps.� Thus ttcp is only a
> factor of 11 short of
> � supporting the hardware at 1 Gbps.
> 
> � This assumes that sleeps of 18 msec are actually
> possible, which
> � they aren't with HZ = 100 giving a granularity of 10
> msec so that
> � sleep(18 msec) actually sleeps for an average of 23
> msec.� -current
> � uses the bad default of HZ = 1000.� With that
> sleep(18 msec) would
> � average 18.5 msec.� Of course, ttcp should sleep
> for more like 1
> � msec if that is possible.� Then the average
> sleep is 1.5 msec.� ttcp
> � can keep up with the hardware with that, and is only
> slightly behind
> � the hardware with the worst-case sleep of 2 msec
> (512+511 packets
> � generated every 2 msec is 511.5 kpps).
> 
> � I normally use old ttcp, except I modify it to sleep
> for 1 msec instead
> � of 18 in one version, and in another version I remove
> the sleep so that
> � it busy-waits in a loop that calls send() which
> almost always returns
> � ENOBUFS.� The latter wastes a lot of CPU, but is
> almost good enough
> � for throughput testing.
> 
> - newer ttcp tries to program the sleep time in
> microseconds.� This doesn't
> � really work, since the sleep granularity is normally
> at least a millisecond,
> � and even if it could be the 340 microseconds needed
> by bge with no ifq
> � (see above, and better divide the 340 by 2), then
> this is quite short
> � and would take almost as much CPU as
> busy-waiting.� I consider HZ = 1000
> � to be another form of polling/busy-waiting and don't
> use it except for
> � testing.
> 
> - netrate/netsend also uses a programmed sleep time.�
> This doesn't really
> � work, as above.� netsend also tries to limit its
> rate based on sleeping.
> � This is further from working, since even
> finer-grained sleeps are needed
> � to limit the rate accurately than to keep up with the
> maxium rate.
> 
> Watermark processing at the kernel level is not quite as
> broken.� It
> is mostly non-existend, but partly works, sort of
> accidentally.� The
> difference is now that there is a tx "eof" or "completion"
> interrupt
> which indicates the condition corresponding to the ENOBUFS
> condition
> going away, so that the kernel doesn't have to poll for
> this.� This
> is not really an "eof" interrupt (unless bge is programmed
> insanely,
> to interrupt only after the tx ring is completely
> empty).� It acts as
> primitive watermarking.� bge can be programmed to
> interrupt after
> having sent every N packets (strictly, after every N buffer
> descriptors,
> but for small packets these are the same).� When there
> are more than
> N packets to start, say M, this acts as a watermark at M-N
> packets.
> bge is normally misprogrammed with N = 10.� At the line
> rate of 1.5 Mpps,
> this asks for an interrupt rate of 150 kHz, which is far too
> high and
> is usually unreachable, so reaching the line rate is
> impossible due to
> the CPU load from the interrupts.� I use N = 384 or 256
> so that the
> interrupt rate is not the dominant limit.� However, N =
> 10 is better
> for latency and works under light loads.� It also
> reduces the amount
> of buffering needed.
> 
> The ifq works more as part of accidentally watermarking than
> as a buffer.
> It is the same size as the tx right (actually 1 smaller for
> bogus reasons),
> so it is not really useful as a buffer.� However, with
> no explicit
> watermarking, any separate buffer like the ifq provides a
> sort of
> watermark at the boundary between the buffers.� The
> usefulness of this
> would most obvious if the tx "eof" interrupt were actually
> for eof
> (perhaps that is what it was originally).� Then on the
> eof interrupt,
> there is no time at all to generate new packets, and the
> time when the
> tx is idle can be minimized by keeping pre-generated packets
> handy where
> the can be copied to the tx ring at tx "eof" interrupt
> time.� A buffer
> of about the same size as the tx ring (or maybe 1/4) the
> size, is enough
> for this.
> 
> OTOH, with bge misprogrammed to interrupt after every 10 tx
> packets, the
> ifq is useless for its watermark purposes.� The
> watermark is effectively
> in the tx ring, and very strangely placed there at 10 below
> the top
> (ring full).� Normally tx watermarks are placed near
> the bottom (ring
> empty).� They must not be placed too near the bottom,
> else there would
> not be enough time to replenish the ring between the time
> when the "eof"
> (really, the "watermark") interrupt is received and when the
> tx runs
> dry.� They should not be placed too near the top like
> they are in -current's
> bge, else the point of having a large tx ring is defeated
> and there are
> too many interrupts.� However, when they are placed
> near the top, latencency
> requirements are reduced.
> 
> I recently worked on buffering for sio and noticed similar
> related
> problems for tx watermarks.� Don't laugh -- serial i/o
> 1 character at
> a time at 3.686400 Mbps has much the same timing
> requirements as
> ethernet i/o 1 packet at a time at 1 Gbps.� Each serial
> character
> takes ~2.7 usec and each minimal ethernet packet takes ~0.67
> usec.
> With tx "ring" sizes of 128 and 512 respectively, the ring
> times for
> full to empty are 347 usec for serial i/o and 341 usec for
> ethernet i/o.
> Strangely, tx is harder than rx because:
> - perfection is possible and easier to measure for tx.�
> It consists of
> � just keeping at least 1 entry in the tx ring at all
> times.� Latency
> � must be kept below ~340 usec to have any chance of
> this.� This is not
> � so easy to achieve under _all_ loads.
> - for rx, you have an external source generating the
> packets, so you
> � don't have to worry about latency affecting the
> generators.
> - the need for watermark processing is better known for rx,
> since it
> � obviously doesn't work to generate the rx "eof"
> interrupt near the
> � top.
> The serial timing was actually harder to satisfy, because I
> worked on
> it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU,
> and even the
> 2GHz CPU couldn't keep up with line rate (so from full to
> empty takes
> 800 usec).
> 
> It turned out that the best position for the tx low
> watermark is about
> 1/4 or 1/2 from the bottom for both sio and bge.� It
> must be fairly
> high, else the latency requirements are not met.� In
> the middle is a
> good general position.� Although it apparently "wastes"
> half of the ring
> to make the latency requirements easier to meet (without
> very
> system-dependent tuning), the efficiency lost from this is
> reasonably
> small.
> 
> Bruce
> 

I'm sure that Bill Paul is a nice man, but referencing drivers that were
written from a template and never properly load tested doesn't really
illustrate anything. All of his drivers are functional but optimized for
nothing.

BC

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1354710777.97879.YahooMailClassic>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation