Date: Tue, 21 Jul 2015 00:17:58 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: net@freebsd.org Subject: minimizing network latency Message-ID: <20150720214422.W836@besplex.bde.org>
next in thread | raw e-mail | index | archive | help
Minimizing network latency is important for minimizing build times on nfs. My main benchmark is makeworld of a version of FreeBSD-5. This currently takes about 130 seconds on an i4690K system, depending mainly on overclocking and network latency. Makeworld does about 800000 RPCs, so every microsecond of network latency costs about 0.8 seconds with -j1. -j16 reduces this to a fraction of 0.8 that is much larger than 1/16. More like 1/4 or 1/2. Untuned systems usually have very bad network latency even locally, due to interrupt moderation. On my systems, defaults give 291 microseconds as measured by ping -fq. 291 would give a makeworld -j16 time of at least 1/4 * 0.8 * 291 = 58 seconds just for the part that waits for the network. I didn't test with this misconfiguration. My network hardware is: Client, usual configuration: em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xf080-0xf09f mem 0xdfd00000-0xdfd1ffff,0xdfd3c000-0xdfd3cfff irq 20 at device 25.0 on pci0 dmesg doesn't give enough details. This is an I218V-mumble on pcie. Client, another configuration: em1: <Intel(R) PRO/1000 Legacy Network Connection 1.0.6> port 0xe000-0xe03f mem 0xdfc40000-0xdfc5ffff,0xdfc20000-0xdfc3ffff irq 16 at device 1.0 on pci4 This is a ~8 year old card on pci33. It has lower latency than the newer card. Server: bge0: <Broadcom BCM5701 Gigabit Ethernet, ASIC rev. 0x105> mem 0xe3000000-0xe300ffff irq 5 at device 10.0 on pci0 This is a ~13 year old pci-x card on pci33. It is higher-end than the Intel cards and has lower latency. I use large modifications to the bge, mainly to tune its latency and throughput for small packets. It is limited by the pci33 hardware and the old 2GHz CPU to 640 kpps in my version. In versions unmodified execpt for interrupt moderation, it is limited to 300-500 kpps (faster in old versions of FreeBSD). bge's default interrupt moderation is: dev.bge.0.rx_coal_ticks: 150 dev.bge.0.tx_coal_ticks: 150 dev.bge.0.rx_max_coal_bds: 10 dev.bge.0.tx_max_coal_bds: 10 where these values are in sysctl form but are hard-coded in unmodified versions. This is very bad. The "ticks" values give a maximum latency of 150 microseconds. This is too large for short bursts of packets. The "bds" (buffer descriptors; 1 or 2 of these per packet) give a maximum latency of 10 bds. This is too small for efficiency (by a factor of about 20). It gives minimal latency for long burst of packets. But it is useless for short bursts of packets, as generated by ping -fq and probably by nfs RPCs. ping -fq doesn't actually flood, except by accidental synchronization with buffering. It tries to send bursts of length 1 and wait for the reply, except it sends an extra packet without waiting if the reply doesn't come back in 10 milliseconds. Sometimes buffering and/or interrupt moderation delays replies so that they arrive in bursts. Sometimes the burst length builds up to the length of the output buffers. Then the throughput is increased but the latency is increased. 800000 RPCs for makeworld is a lot by some measures but not by others. Over 130 seconds it is just 6 kpps in each direction. Its average inter-packet time is 162 microseconds. This is > 150, so its its average latency is about 150 microseconds. The "bds" limits are useless since 10 bds take an average of 1620 microseconds. The final ping -fq average latency of 291 is about 150 from bge, 125 from the corresponding limit in em, and a few extra for doing the non-waiting parts. My normal configuration is: dev.bge.0.tx_coal_ticks: 1000000 dev.bge.0.tx_max_coal_bds: 384 dev.bge.0.dyncoal_max_intr_freq: 10000 Here the tx limits are essentially infinity. This reduces tx interrupt load at no cost to latency. The rx limits are dynamic. dyncoal_max_intr_freq=10000 is in software. It works much like em's itr limit, but slightly better. Under light load, there is no interrupt moderation for rx. Under heavy load, it is rate-limited to the specified frequency. ping -fq should give heavy load, but actually gives light load, due to it not actually flooding and interrupt moderation on the sender. The above configuration combined with the default em configuration gives an average ping latency of 122 microseconds. 122 is from em's itr being 125 microseconds (frequency 8000). em's default interrupt moderation is: dev.em.1.itr: 488 dev.em.1.itr: interrupt delay limit in usecs/4 dev.em.1.tx_abs_int_delay: 66 dev.em.1.rx_abs_int_delay: 66 dev.em.1.tx_int_delay: 66 dev.em.1.rx_int_delay: 0 dev.em.0.itr: 125 dev.em.0.itr: interrupt delay limit in usecs dev.em.0.tx_abs_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_int_delay: 66 dev.em.0.rx_int_delay: 0 Here the itrs differ because their sysctl is broken in -current and I only fixed this in if_em.c (em.1 uses if_lem.c). The limit on the interrupt frequency is supposed to be 8000 Hz in both. This is respresented by the hardware as a period of 125 microseconds times a scale factor of approximately 4, so the raw hardware value is approximately 500 and actually 488. The raw hardware value is exposed by the sysctl and misdescribed as being usecs/4. It is actually usecs *times* 4 scaled by (1024/1000) and rounded down = 488. The read and write sysctls work right provided you ignore the initial value and the documentation. They do the reverse conversion, so only work right with units of microseconds. So if you write back the initial value of 488, you change the setting from 125 microseconds to 488 microseconds. You don't get the initial setting of 125 microseconds, or the documented setting of 488*4 = 1952 microseconds or the mis-scaled setting of 488/4 = 122 microseconds. Other bugs in em sysctls: despite (or because of) using large code to set the values at the time of the sysctl, the settings are not made for up/down or when ifconfig is used to change an unrelated setting. The sysctls still report whatever they were set to, but the hardware has been reprogrammed to default values. (There are bogus tunables for the defaults. Supporting these and all combinations takes even larger code). My bge sysctls work better. They are just SYSCTL_INT()s. Then another sysctl or up/down is used to write the accumulated sysctl settings to the hardware. This allows changing all the settings at once. The change is a heavyweight operation, and even if individual settings can be changed one at a time such changes may require delicate ordering to avoid going through combinations that don't work. Note that rx_int_delay is already 0 for em. This matches my dynamic bge tuning (rx_coal_ticks and rx_max_coal_bds are actually 1 for that; rx_coal_ticks is a maximum corresponding to rx_abs_int_delay and rx_max_coal_bds is a maximum that only partially corresponds to rx_int_delay since that is a minimum). The code has a comments saying that rx_int_delay is set to 0 to avoid bugs, but I think this setting is needed more to allow the itr setting to work. To minimize latency, I kill all rx interrupt moderation using the sysctls: # tx moderation is left at nearly infinity for bge since that works # right for bge. sysctl dev.bge.0.rx_coal_ticks=1 sysctl dev.bge.0.tx_coal_ticks=1000000 sysctl dev.bge.0.rx_max_coal_bds=1 sysctl dev.bge.0.tx_max_coal_bds=256 # Setting itr to 0 is enough. The other settings are for variations # when setting itr to a small value. sysctl dev.em.em0.rx_int_delay=0 sysctl dev.em.em0.rx_abs_int_delay=66 sysctl dev.em.em0.tx_int_delay=66666 sysctl dev.em.em0.tx_abs_int_delay=66666 sysctl dev.em.em0.itr=0 # I didn't try so many settings for em1. sysctl dev.em.em1.rx_int_delay=0 sysctl dev.em.em1.rx_abs_int_delay=0 sysctl dev.em.em1.tx_int_delay=0 sysctl dev.em.em1.tx_abs_int_delay=0 sysctl dev.em.em1.itr=0 This tuning reduces the ping -fq latency from 122 to 50 microseconds for em1, but only from 122 to 74 microseconds for em0. These times are with a low-end switch. In previous tests with different low-end switches, the switch seemed to make a little difference. But it makes a big difference with em0. Using a direct connection reduces the latency by 24 microseconds (to 50) for em0, but only by 6 (to 44) for em1. In previous experiments, I got a latency of 30 or 36 microseconds for bge <-> em1 using DEVICE_POLLING (ick) when em1 was in a slower system. IIRC, there was a switch in between. I now get 25 microseconds for the best combination of bge <-> em1 using DEVICE_POLLING with em1 in a much faster system (but still on pci33) and no switch in between. DEVICE_POLLING must use poll_in_idle to give low latency. This almost literally burns overclocked cores. I fixed some bugs in DEVICE_POLLING so that this doesn't count in the load average, and uses cpu_spinwait(). cpu_spinwait() reduces the burning significantly and the idle polling works very well if there is a core to spare. It even works OK for makeworld since although there shouldn't be a core to spare, there is when the build stalls waiting for RPCs and then the best thing to do is burn a core waiting for them as fast as possible. em0 supports tso4. Turning it off made no difference. Questions: Why is the pcie hardware slower? Why does the switch make more difference for the pcie hardware? Why is the latency so large even in the best case? I think it is mostly in the hardware. localhost ping latency is about 2 microseconds (also very bad, but uch smaller than 25). Under load, bge achieves a throughput of 640 kpps for minimal-sized udp packets. That is 1.6 microseconds between packets. So it must be able to handle a packet in 1.6 microseconds, but it or the other side apparently takes at least 12 microseconds each to make the change visible to the OS, even when both sides are spinning polling the device for activity. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150720214422.W836>