FreeBSD Mail Archives

Date:      Mon, 17 Dec 2018 14:50:04 -0500
From:      Andrew Gallatin <gallatin@cs.duke.edu>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Slava Shwartsman <slavash@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
Message-ID:  <b81d9232-d703-2d4f-eec2-f9b48a0ccd3b@cs.duke.edu>
In-Reply-To: <20181218033137.Q2217@besplex.bde.org>
References:  <201812051420.wB5EKwxr099242@repo.freebsd.org> <9e09a2f8-d9fd-7fde-8e5a-b7c566cdb6a9@cs.duke.edu> <20181218033137.Q2217@besplex.bde.org>

On 12/17/18 2:08 PM, Bruce Evans wrote:
> On Mon, 17 Dec 2018, Andrew Gallatin wrote:
> 
>> On 12/5/18 9:20 AM, Slava Shwartsman wrote:
>>> Author: slavash
>>> Date: Wed Dec  5 14:20:57 2018
>>> New Revision: 341578
>>> URL: 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= 
>>>
>>>
>>> Log:
>>>    mlx5en: Remove the DRBR and associated logic in the transmit path.
>>>       The hardware queues are deep enough currently and using the 
>>> DRBR and associated
>>>    callbacks only leads to more task switching in the TX path. The is 
>>> also a race
>>>    setting the queue_state which can lead to hung TX rings.
>>
>> The point of DRBR in the tx path is not simply to provide a software 
>> ring for queuing excess packets.  Rather it provides a mechanism to
>> avoid lock contention by shoving a packet into the software ring, where
>> it will later be found & processed, rather than blocking the caller on
>> a mtx lock.   I'm concerned you may have introduced a performance
>> regression for use cases where you have N:1  or N:M lock contention 
>> where many threads on different cores are contending for the same tx 
>> queue.  The state of the art for this is no longer DRBR, but mp_ring,
>> as used by both cxgbe and iflib.
> 
> iflib uses queuing techniques to significantly pessimize em NICs with 1
> hardware queue.  On fast machines, it attempts to do 1 context switch per

This can happen even w/o contention when "abdicate" is enabled in mp
ring. I complained about this as well, and the default was changed in
mp ring to not always "abdicate" (eg, switch to the tq to handle the
packet). Abdication substantially pessimizes Netflix style web 
uncontended workloads, but it generally helps small packet forwarding.

It is interesting that you see the opposite.  I should try benchmarking
with just a single ring.



> (small) tx packet and can't keep up.  On slow machines it has a chance of
> handling multiple packets per context switch, but since the machine is too
> slow it can't keep up and saturates at a slightly different point.  Results
> for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V
> on Haswell 4 cores x 2 threads @4.08GHz running i386:
> 
> Old results with no iflib and no EM_MULTIQUEUE except as indicated:
> 
> FBSD-10     UP    1377+0
> FBSD-11     UP    1326+0
> FBSD-11     SMP-1 1484+0
> FBSD-11     SMP-8 1395+0
> FBSD-12mod  SMP-1 1386+0
> FBSD-12mod  SMP-8 1422+0
> FBSD-12mod  SMP-1 1270+0   # use iflib (lose 8% performance)
> FBSD-12mod  SMP-8 1279+0   # use iflib (lose 10% performance using more 
> CPU)
> 
> 1377+0 means 1377 kpps sent and 0 kpps errors, etc.  SMP-8 means use all 8
> CPUs.  SMP-1 means restrict netblast to 1 CPU different from the taskqueue
> CPUs using cpuset.
> 
> New results:
> 
> FBSD-11     SMP-8 1440+0   # no iflib, no EM_MULTIQUEUE
> FBSD-11     SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 
> 1Gbps)
> FBSD-cur    SMP-8  533+0   # use iflib, use i386 with 4G KVA
> 
> iflib only decimates performance relative to the FreeBSD-11 version
> with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using
> more CPUs.  This gives the extra 10-20% of performance needed to
> saturate the NIC and 1Gbps ethernet.  The FreeBSD-current version is
> not directly comparable since using 4G KVA on i386 reduces performance
> by about a factor of 2.5 for all loads with mostly small i/o's (for
> 128K disk i/o's the reduction is only 10-20%).  i386 ran at about the
> same speed as amd64 when it had 1GB KVA, but I don't have any savd
> results for amd64 to compare with precisely).  This is all with
> security-related things like ibrs unavailable or turned off.
> 
> All versions use normal Intel interrupt moderation which gives an interrupt
> rate of 8k/sec.
> 
> Old versions of em use a "fast" interrupt handler and a slow switch
> to a taskqueue.  This gives a contex switch rate of about 16k/ sec.
> In the SMP case, netblast normally runs on another CPU and I think it
> fills h/w tx queue(s) synchronously, and the taskqueue only does minor
> cleanups.  Old em also has a ping latency of about 10% smaller than
> with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and
> other tuning to kill interrupt moderation, and similar for a bge NIC
> on the other end).  The synchronous queue filling probably improves
> latency, but it is hard to see how it makes a difference of more than
> 1 usec.  73 is already too high.  An old PRO1000 Intel NIC has a latency
> of only 50 usec on the same network.  The switch costs about 20 usec
> of this.
> 
> iflib uses taskqueue more.  netblast normally runs on another CPU and
> I think it only fills s/w tx queue(s) synchronously, and wakes up the
> taskqueues for every packet.  The CPUs are almost fast enough to keep
> up, and the system does about 1M context switches for this (in versions
> other than i386 with 4G KVA).  That is slightly more than 2 packets per
> switch to get the speed of 1279 kpps.  netblast uses 100% of 1 CPU but
> the taskqueues don't saturate their CPUs although they should so as to
> do even more context switches.  They still use a lot of CPU (about 50%
> of 1 CPU more than in old em).  These context switches lose by doing
> the opposite of interrupt moderation.
> 
> I can "fix" the extra context switches and restore some of the lost
> performance and most of the lost CPU by running netblast on the same
> CPU as the main taskqueue (and using my normal configuration of no
> PREEMPTION and no IPI_PREEMPTION) or by breaking the scheduler to never
> preempt to a higher priority thread.  Non-broken schedulers preempt
> idle threads to run higher priority threads even without PREEMPTION.
> PREEMPTION gives this preemption for non-idle threads too.  So my
> "fix" stops the taskqueue being preempted to on every packet.
> netblast gets preempted eventually and waits for the taskqueue, but
> it still manages to send more packets using less CPU.
> 
> My "fix" doesn't quite give UP behaviour.  PREEMPTION is necessary with
> UP, and the "fix" depends on not having it.  I haven't tested this.
> Scheduling makes little difference for old em since the taskqueue only
> runs for tx interrupts and then does very little.  tx interrupts are
> very unimportant for this benchmark on old em and bge.  My bge tuning
> delays them for up to 1 second if possible when tuning for throughput
> over latency.
> 
> The relative effect of this "fix" is shown for the PRO1000 NIC by:
> 
> FBSD-cur  SMP-1 293+773    # iflib, i386 with 4G KVA, cpuset to taskq CPU
> FBSD-cur  SMP-1 like SMP-8 # iflib, i386 with 4G KVA, cpuset to 
> non-taskq CPU
> FBSD-cur  SMP-8 279+525    # iflib, i386 with 4G KVA
> 
> This NIC seemed to saturate and 280 kpps on all systems, but the "fix"
> lets it reach 293 kpps and leaves enough CPU to spare to generate and
> drop 248 kpps.  The dropped packet count is a good test of the comination
> of CPU to spare and efficiency of dropping packets.  Old versions of
> FreeBSD and em have much more CPU to spare and drop packets more 
> efficiently
> by peeking at the ifq high in the network stack.  They can generate and
> drop about 2000 kpps on this NIC, but the best iflib version can only
> do this for about 1000 kpps.
> 
> The Haswell CPU has 4 cores x 2 threads, and sharing CPUs is about 67%
> slower for each CPU of an HTT pair.  The main taskq is on CPU6 and the
> other taskq is on CPU7.  Running netblast on CPU6 gives the speedup.
> Running netbast on CP7 gives HTT contention, but this makes little
> difference.  On the PRO1000 where the NIC saturates first so that the
> taskq's don't run so often, there CPU uses are about 35% for CPU6 and
> 1% for CPU7 when netblast is run on CPU0.  So there is only about 35%
> HTT and netblast contention when netblast is run on CPU7.
> 
>> For well behaved workloads (like Netflix's), I don't anticipate
>> this being a performance issue.  However, I worry that this will impact
>> other workloads and that you should consider running some testing of
>> N:1 contention.   Eg, 128 netperfs running in parallel with only
>> a few nic tx rings.
> 
> For the I218V, before iflib 2 netblasts got closer to saturating the NIC
> but 8 netblasts were slower than 1.  Checking this now with the PRO1000,
> the total kpps counts (all with about 280 kpps actually sent) are:
> 
> 1 netblast:   537
> 2 netblasts:  949 (high variance from now on, just a higher sample)
> 3 netblasts: 1123
> 4 netblasts: 1206
> 5 netblasts: 1129
> 6 netblasts: 1094
> 7 netblasts: 1080
> 8 netblasts: 1016
> 
> So the contention is much the same as before for the dropping-packets part
> after the NIC saturates.  Maybe it is all in the network stack.  There is
> a lot of slowness there too, so a 4GHz CPU is needed to almost keep up with
> the network for small packets sent by any 1Gbps NIC.
> 
> Multi-queue NICs obviously need something like taskqueues to avoid 
> contention
> with multiple senders, but to be fast the taskqueues you have to have 
> enough
> CPUs to dedicate 1 CPU per queue and don't wast time and latency context-
> switching this CPU to the idle thread.  According to lmbench, the context
> switch latency on the test system is between 1.1 and 1.8 usec for all cases
> between 2proc/0K and 16proc/64K.  Context switches to and from the idle
> thread are much faster, and they need to be to reach 1M/sec.  Watching
> context switches more carefully using top -m io shows that for 1 netblast
> to the PRO1000 they are:
> 
> 259k/sec for if_io_tqg_6 (names are too verbose and are truncated by top)
> 259k/sec for idle: cpu<truncated> on same CPU as above
> 7.9k/sec for if_io_tqg_7
> 7.9k/sec for idle: cpu<truncated> on same CPU as above
> 
> These are much less than 1M/sec because i386 with 4G KVA is several times
> slower than i386 with 1G KVA.
> 
> I mostly use the PRO1000 because its ping latency with best configuration
> is 50 usec instead of 80 usec and only the latency matters for nfs use.
> 
> Bruce

I think part of the problem with iflib is just its sheer size, but
some of that is hard to avoid due to it being an intermediate shim
layer & having to translate packets.

One thing I think could help is to do the conversion from mbufs
to iflib's packet info data structure at entry into iflib, and not
at xmit time.  This the stuff that parses
the mbufs, does the virt to phys, and basically will take cache
misses if it runs on a different CPU, but seems less likely to
take cache misses if run just after ether_output() (which has
likely already taken misses).   I've been trying to find the
time to make this change for a while.  It would be interesting
to see if it helps your workload too.

Drew

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b81d9232-d703-2d4f-eec2-f9b48a0ccd3b>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation