Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Jul 2015 00:17:58 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        net@freebsd.org
Subject:   minimizing network latency
Message-ID:  <20150720214422.W836@besplex.bde.org>

next in thread | raw e-mail | index | archive | help
Minimizing network latency is important for minimizing build times on nfs.
My main benchmark is makeworld of a version of FreeBSD-5.  This currently
takes about 130 seconds on an i4690K system, depending mainly on
overclocking and network latency.  Makeworld does about 800000 RPCs, so
every microsecond of network latency costs about 0.8 seconds with -j1.
-j16 reduces this to a fraction of 0.8 that is much larger than 1/16.
More like 1/4 or 1/2.

Untuned systems usually have very bad network latency even locally, due
to interrupt moderation.  On my systems, defaults give 291 microseconds
as measured by ping -fq.  291 would give a makeworld -j16 time of at least
1/4 * 0.8 * 291 = 58 seconds just for the part that waits for the network.
I didn't test with this misconfiguration.

My network hardware is:

Client, usual configuration:
em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xf080-0xf09f mem 0xdfd00000-0xdfd1ffff,0xdfd3c000-0xdfd3cfff irq 20 at device 25.0 on pci0

dmesg doesn't give enough details.  This is an I218V-mumble on pcie.

Client, another configuration:
em1: <Intel(R) PRO/1000 Legacy Network Connection 1.0.6> port 0xe000-0xe03f mem 0xdfc40000-0xdfc5ffff,0xdfc20000-0xdfc3ffff irq 16 at device 1.0 on pci4

This is a ~8 year old card on pci33.  It has lower latency than the newer
card.

Server:
bge0: <Broadcom BCM5701 Gigabit Ethernet, ASIC rev. 0x105> mem 0xe3000000-0xe300ffff irq 5 at device 10.0 on pci0

This is a ~13 year old pci-x card on pci33.  It is higher-end than the Intel
cards and has lower latency.

I use large modifications to the bge, mainly to tune its latency and
throughput for small packets.  It is limited by the pci33 hardware and
the old 2GHz CPU to 640 kpps in my version.  In versions unmodified
execpt for interrupt moderation, it is limited to 300-500 kpps (faster
in old versions of FreeBSD).

bge's default interrupt moderation is:

     dev.bge.0.rx_coal_ticks: 150
     dev.bge.0.tx_coal_ticks: 150
     dev.bge.0.rx_max_coal_bds: 10
     dev.bge.0.tx_max_coal_bds: 10

where these values are in sysctl form but are hard-coded in unmodified
versions.  This is very bad.  The "ticks" values give a maximum latency
of 150 microseconds.  This is too large for short bursts of packets.
The "bds" (buffer descriptors; 1 or 2 of these per packet) give a
maximum latency of 10 bds.  This is too small for efficiency (by a
factor of about 20).  It gives minimal latency for long burst of
packets.  But it is useless for short bursts of packets, as generated
by ping -fq and probably by nfs RPCs.

ping -fq doesn't actually flood, except by accidental synchronization
with buffering.  It tries to send bursts of length 1 and wait for the
reply, except it sends an extra packet without waiting if the reply
doesn't come back in 10 milliseconds.  Sometimes buffering and/or
interrupt moderation delays replies so that they arrive in bursts.
Sometimes the burst length builds up to the length of the output
buffers.  Then the throughput is increased but the latency is increased.

800000 RPCs for makeworld is a lot by some measures but not by others.
Over 130 seconds it is just 6 kpps in each direction.  Its average
inter-packet time is 162 microseconds.  This is > 150, so its its
average latency is about 150 microseconds.  The "bds" limits are
useless since 10 bds take an average of 1620 microseconds.  The final
ping -fq average latency of 291 is about 150 from bge, 125 from the
corresponding limit in em, and a few extra for doing the non-waiting
parts.

My normal configuration is:

     dev.bge.0.tx_coal_ticks: 1000000
     dev.bge.0.tx_max_coal_bds: 384
     dev.bge.0.dyncoal_max_intr_freq: 10000

Here the tx limits are essentially infinity.  This reduces tx interrupt
load at no cost to latency.  The rx limits are dynamic.
dyncoal_max_intr_freq=10000 is in software.  It works much like em's
itr limit, but slightly better.  Under light load, there is no interrupt
moderation for rx.  Under heavy load, it is rate-limited to the specified
frequency.

ping -fq should give heavy load, but actually gives light load, due to it
not actually flooding and interrupt moderation on the sender.  The above
configuration combined with the default em configuration gives an average
ping latency of 122 microseconds.  122 is from em's itr being 125
microseconds (frequency 8000).

em's default interrupt moderation is:

     dev.em.1.itr: 488
     dev.em.1.itr: interrupt delay limit in usecs/4
     dev.em.1.tx_abs_int_delay: 66
     dev.em.1.rx_abs_int_delay: 66
     dev.em.1.tx_int_delay: 66
     dev.em.1.rx_int_delay: 0

     dev.em.0.itr: 125
     dev.em.0.itr: interrupt delay limit in usecs
     dev.em.0.tx_abs_int_delay: 66
     dev.em.0.rx_abs_int_delay: 66
     dev.em.0.tx_int_delay: 66
     dev.em.0.rx_int_delay: 0

Here the itrs differ because their sysctl is broken in -current and I
only fixed this in if_em.c (em.1 uses if_lem.c).  The limit on the
interrupt frequency is supposed to be 8000 Hz in both.  This is
respresented by the hardware as a period of 125 microseconds times
a scale factor of approximately 4, so the raw hardware value is
approximately 500 and actually 488.  The raw hardware value is
exposed by the sysctl and misdescribed as being usecs/4.  It is
actually usecs *times* 4 scaled by (1024/1000) and rounded down = 488.
The read and write sysctls work right provided you ignore the initial
value and the documentation.  They do the reverse conversion, so only
work right with units of microseconds.  So if you write back the initial
value of 488, you change the setting from 125 microseconds to 488
microseconds.  You don't get the initial setting of 125 microseconds,
or the documented setting of 488*4 = 1952 microseconds or the mis-scaled
setting of 488/4 = 122 microseconds.

Other bugs in em sysctls: despite (or because of) using large code to
set the values at the time of the sysctl, the settings are not made
for up/down or when ifconfig is used to change an unrelated setting.
The sysctls still report whatever they were set to, but the hardware
has been reprogrammed to default values.  (There are bogus tunables
for the defaults.  Supporting these and all combinations takes even
larger code).  My bge sysctls work better.  They are just SYSCTL_INT()s.
Then another sysctl or up/down is used to write the accumulated sysctl
settings to the hardware.  This allows changing all the settings at
once.  The change is a heavyweight operation, and even if individual
settings can be changed one at a time such changes may require delicate
ordering to avoid going through combinations that don't work.

Note that rx_int_delay is already 0 for em.  This matches my dynamic
bge tuning (rx_coal_ticks and rx_max_coal_bds are actually 1 for that;
rx_coal_ticks is a maximum corresponding to rx_abs_int_delay and
rx_max_coal_bds is a maximum that only partially corresponds to
rx_int_delay since that is a minimum).  The code has a comments saying
that rx_int_delay is set to 0 to avoid bugs, but I think this setting
is needed more to allow the itr setting to work.

To minimize latency, I kill all rx interrupt moderation using the
sysctls:

     # tx moderation is left at nearly infinity for bge since that works
     # right for bge.
     sysctl dev.bge.0.rx_coal_ticks=1
     sysctl dev.bge.0.tx_coal_ticks=1000000
     sysctl dev.bge.0.rx_max_coal_bds=1
     sysctl dev.bge.0.tx_max_coal_bds=256

     # Setting itr to 0 is enough.  The other settings are for variations
     # when setting itr to a small value.
     sysctl dev.em.em0.rx_int_delay=0
     sysctl dev.em.em0.rx_abs_int_delay=66
     sysctl dev.em.em0.tx_int_delay=66666
     sysctl dev.em.em0.tx_abs_int_delay=66666
     sysctl dev.em.em0.itr=0

     # I didn't try so many settings for em1.
     sysctl dev.em.em1.rx_int_delay=0
     sysctl dev.em.em1.rx_abs_int_delay=0
     sysctl dev.em.em1.tx_int_delay=0
     sysctl dev.em.em1.tx_abs_int_delay=0
     sysctl dev.em.em1.itr=0

This tuning reduces the ping -fq latency from 122 to 50 microseconds for
em1, but only from 122 to 74 microseconds for em0.

These times are with a low-end switch.  In previous tests with different
low-end switches, the switch seemed to make a little difference.  But
it makes a big difference with em0.  Using a direct connection reduces
the latency by 24 microseconds (to 50) for em0, but only by 6 (to 44)
for em1.

In previous experiments, I got a latency of 30 or 36 microseconds for
bge <-> em1 using DEVICE_POLLING (ick) when em1 was in a slower system.
IIRC, there was a switch in between.  I now get 25 microseconds for
the best combination of bge <-> em1 using DEVICE_POLLING with em1 in
a much faster system (but still on pci33) and no switch in between.

DEVICE_POLLING must use poll_in_idle to give low latency.  This almost
literally burns overclocked cores.  I fixed some bugs in DEVICE_POLLING
so that this doesn't count in the load average, and uses cpu_spinwait().
cpu_spinwait() reduces the burning significantly and the idle polling
works very well if there is a core to spare.  It even works OK for
makeworld since although there shouldn't be a core to spare, there is
when the build stalls waiting for RPCs and then the best thing to do
is burn a core waiting for them as fast as possible.

em0 supports tso4.  Turning it off made no difference.

Questions:

Why is the pcie hardware slower?

Why does the switch make more difference for the pcie hardware?

Why is the latency so large even in the best case?  I think it is
mostly in the hardware.  localhost ping latency is about 2 microseconds
(also very bad, but uch smaller than 25).  Under load, bge achieves a
throughput of 640 kpps for minimal-sized udp packets.  That is 1.6
microseconds between packets.  So it must be able to handle a packet
in 1.6 microseconds, but it or the other side apparently takes at
least 12 microseconds each to make the change visible to the OS,
even when both sides are spinning polling the device for activity.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150720214422.W836>