Date: Tue, 18 Dec 2018 06:08:37 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Andrew Gallatin <gallatin@cs.duke.edu> Cc: Slava Shwartsman <slavash@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en Message-ID: <20181218033137.Q2217@besplex.bde.org> In-Reply-To: <9e09a2f8-d9fd-7fde-8e5a-b7c566cdb6a9@cs.duke.edu> References: <201812051420.wB5EKwxr099242@repo.freebsd.org> <9e09a2f8-d9fd-7fde-8e5a-b7c566cdb6a9@cs.duke.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 17 Dec 2018, Andrew Gallatin wrote: > On 12/5/18 9:20 AM, Slava Shwartsman wrote: >> Author: slavash >> Date: Wed Dec 5 14:20:57 2018 >> New Revision: 341578 >> URL: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= >> >> Log: >> mlx5en: Remove the DRBR and associated logic in the transmit path. >> The hardware queues are deep enough currently and using the DRBR and >> associated >> callbacks only leads to more task switching in the TX path. The is also >> a race >> setting the queue_state which can lead to hung TX rings. > > The point of DRBR in the tx path is not simply to provide a software ring for > queuing excess packets. Rather it provides a mechanism to > avoid lock contention by shoving a packet into the software ring, where > it will later be found & processed, rather than blocking the caller on > a mtx lock. I'm concerned you may have introduced a performance > regression for use cases where you have N:1 or N:M lock contention where > many threads on different cores are contending for the same tx queue. The > state of the art for this is no longer DRBR, but mp_ring, > as used by both cxgbe and iflib. iflib uses queuing techniques to significantly pessimize em NICs with 1 hardware queue. On fast machines, it attempts to do 1 context switch per (small) tx packet and can't keep up. On slow machines it has a chance of handling multiple packets per context switch, but since the machine is too slow it can't keep up and saturates at a slightly different point. Results for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V on Haswell 4 cores x 2 threads @4.08GHz running i386: Old results with no iflib and no EM_MULTIQUEUE except as indicated: FBSD-10 UP 1377+0 FBSD-11 UP 1326+0 FBSD-11 SMP-1 1484+0 FBSD-11 SMP-8 1395+0 FBSD-12mod SMP-1 1386+0 FBSD-12mod SMP-8 1422+0 FBSD-12mod SMP-1 1270+0 # use iflib (lose 8% performance) FBSD-12mod SMP-8 1279+0 # use iflib (lose 10% performance using more CPU) 1377+0 means 1377 kpps sent and 0 kpps errors, etc. SMP-8 means use all 8 CPUs. SMP-1 means restrict netblast to 1 CPU different from the taskqueue CPUs using cpuset. New results: FBSD-11 SMP-8 1440+0 # no iflib, no EM_MULTIQUEUE FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps) FBSD-cur SMP-8 533+0 # use iflib, use i386 with 4G KVA iflib only decimates performance relative to the FreeBSD-11 version with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using more CPUs. This gives the extra 10-20% of performance needed to saturate the NIC and 1Gbps ethernet. The FreeBSD-current version is not directly comparable since using 4G KVA on i386 reduces performance by about a factor of 2.5 for all loads with mostly small i/o's (for 128K disk i/o's the reduction is only 10-20%). i386 ran at about the same speed as amd64 when it had 1GB KVA, but I don't have any savd results for amd64 to compare with precisely). This is all with security-related things like ibrs unavailable or turned off. All versions use normal Intel interrupt moderation which gives an interrupt rate of 8k/sec. Old versions of em use a "fast" interrupt handler and a slow switch to a taskqueue. This gives a contex switch rate of about 16k/ sec. In the SMP case, netblast normally runs on another CPU and I think it fills h/w tx queue(s) synchronously, and the taskqueue only does minor cleanups. Old em also has a ping latency of about 10% smaller than with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and other tuning to kill interrupt moderation, and similar for a bge NIC on the other end). The synchronous queue filling probably improves latency, but it is hard to see how it makes a difference of more than 1 usec. 73 is already too high. An old PRO1000 Intel NIC has a latency of only 50 usec on the same network. The switch costs about 20 usec of this. iflib uses taskqueue more. netblast normally runs on another CPU and I think it only fills s/w tx queue(s) synchronously, and wakes up the taskqueues for every packet. The CPUs are almost fast enough to keep up, and the system does about 1M context switches for this (in versions other than i386 with 4G KVA). That is slightly more than 2 packets per switch to get the speed of 1279 kpps. netblast uses 100% of 1 CPU but the taskqueues don't saturate their CPUs although they should so as to do even more context switches. They still use a lot of CPU (about 50% of 1 CPU more than in old em). These context switches lose by doing the opposite of interrupt moderation. I can "fix" the extra context switches and restore some of the lost performance and most of the lost CPU by running netblast on the same CPU as the main taskqueue (and using my normal configuration of no PREEMPTION and no IPI_PREEMPTION) or by breaking the scheduler to never preempt to a higher priority thread. Non-broken schedulers preempt idle threads to run higher priority threads even without PREEMPTION. PREEMPTION gives this preemption for non-idle threads too. So my "fix" stops the taskqueue being preempted to on every packet. netblast gets preempted eventually and waits for the taskqueue, but it still manages to send more packets using less CPU. My "fix" doesn't quite give UP behaviour. PREEMPTION is necessary with UP, and the "fix" depends on not having it. I haven't tested this. Scheduling makes little difference for old em since the taskqueue only runs for tx interrupts and then does very little. tx interrupts are very unimportant for this benchmark on old em and bge. My bge tuning delays them for up to 1 second if possible when tuning for throughput over latency. The relative effect of this "fix" is shown for the PRO1000 NIC by: FBSD-cur SMP-1 293+773 # iflib, i386 with 4G KVA, cpuset to taskq CPU FBSD-cur SMP-1 like SMP-8 # iflib, i386 with 4G KVA, cpuset to non-taskq CPU FBSD-cur SMP-8 279+525 # iflib, i386 with 4G KVA This NIC seemed to saturate and 280 kpps on all systems, but the "fix" lets it reach 293 kpps and leaves enough CPU to spare to generate and drop 248 kpps. The dropped packet count is a good test of the comination of CPU to spare and efficiency of dropping packets. Old versions of FreeBSD and em have much more CPU to spare and drop packets more efficiently by peeking at the ifq high in the network stack. They can generate and drop about 2000 kpps on this NIC, but the best iflib version can only do this for about 1000 kpps. The Haswell CPU has 4 cores x 2 threads, and sharing CPUs is about 67% slower for each CPU of an HTT pair. The main taskq is on CPU6 and the other taskq is on CPU7. Running netblast on CPU6 gives the speedup. Running netbast on CP7 gives HTT contention, but this makes little difference. On the PRO1000 where the NIC saturates first so that the taskq's don't run so often, there CPU uses are about 35% for CPU6 and 1% for CPU7 when netblast is run on CPU0. So there is only about 35% HTT and netblast contention when netblast is run on CPU7. > For well behaved workloads (like Netflix's), I don't anticipate > this being a performance issue. However, I worry that this will impact > other workloads and that you should consider running some testing of > N:1 contention. Eg, 128 netperfs running in parallel with only > a few nic tx rings. For the I218V, before iflib 2 netblasts got closer to saturating the NIC but 8 netblasts were slower than 1. Checking this now with the PRO1000, the total kpps counts (all with about 280 kpps actually sent) are: 1 netblast: 537 2 netblasts: 949 (high variance from now on, just a higher sample) 3 netblasts: 1123 4 netblasts: 1206 5 netblasts: 1129 6 netblasts: 1094 7 netblasts: 1080 8 netblasts: 1016 So the contention is much the same as before for the dropping-packets part after the NIC saturates. Maybe it is all in the network stack. There is a lot of slowness there too, so a 4GHz CPU is needed to almost keep up with the network for small packets sent by any 1Gbps NIC. Multi-queue NICs obviously need something like taskqueues to avoid contention with multiple senders, but to be fast the taskqueues you have to have enough CPUs to dedicate 1 CPU per queue and don't wast time and latency context- switching this CPU to the idle thread. According to lmbench, the context switch latency on the test system is between 1.1 and 1.8 usec for all cases between 2proc/0K and 16proc/64K. Context switches to and from the idle thread are much faster, and they need to be to reach 1M/sec. Watching context switches more carefully using top -m io shows that for 1 netblast to the PRO1000 they are: 259k/sec for if_io_tqg_6 (names are too verbose and are truncated by top) 259k/sec for idle: cpu<truncated> on same CPU as above 7.9k/sec for if_io_tqg_7 7.9k/sec for idle: cpu<truncated> on same CPU as above These are much less than 1M/sec because i386 with 4G KVA is several times slower than i386 with 1G KVA. I mostly use the PRO1000 because its ping latency with best configuration is 50 usec instead of 80 usec and only the latency matters for nfs use. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20181218033137.Q2217>