From owner-svn-src-stable-12@freebsd.org Thu Feb 14 04:32:35 2019 Return-Path: Delivered-To: svn-src-stable-12@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A261A14E8CDB; Thu, 14 Feb 2019 04:32:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 7C5FB8E5CD; Thu, 14 Feb 2019 04:32:32 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id D5FC1436A84; Thu, 14 Feb 2019 15:32:19 +1100 (AEDT) Date: Thu, 14 Feb 2019 15:32:18 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Marius Strobl cc: rgrimes@freebsd.org, John Baldwin , src-committers@freebsd.org, Patrick Kelsey , svn-src-stable@freebsd.org, svn-src-all@freebsd.org, svn-src-stable-12@freebsd.org Subject: Re: svn commit: r344027 - in stable/12/sys: dev/vmware/vmxnet3 modules/vmware/vmxnet3 net In-Reply-To: <20190212235435.GB92760@alchemy.franken.de> Message-ID: <20190214130307.T991@besplex.bde.org> References: <62d2dcc1-5bde-1eda-6d9f-82138932cb36@FreeBSD.org> <201902120124.x1C1OI5b073609@pdx.rh.CN85.dnsmgr.net> <20190212235435.GB92760@alchemy.franken.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=dvcefDqHsCUFnUdRA5oA:9 a=CjuIK1q_8ugA:10 X-Rspamd-Queue-Id: 7C5FB8E5CD X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of brde@optusnet.com.au designates 211.29.132.246 as permitted sender) smtp.mailfrom=brde@optusnet.com.au X-Spamd-Result: default: False [-5.89 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; RCVD_IN_DNSWL_LOW(-0.10)[246.132.29.211.list.dnswl.org : 127.0.5.1]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:211.29.132.0/23]; FREEMAIL_FROM(0.00)[optusnet.com.au]; MIME_GOOD(-0.10)[text/plain]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DMARC_NA(0.00)[optusnet.com.au]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; IP_SCORE(-2.95)[ip: (-7.30), ipnet: 211.28.0.0/14(-4.13), asn: 4804(-3.28), country: AU(-0.04)]; MX_GOOD(-0.01)[extmail.optusnet.com.au]; NEURAL_HAM_SHORT(-0.63)[-0.632,0]; RCPT_COUNT_SEVEN(0.00)[8]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[optusnet.com.au]; ASN(0.00)[asn:4804, ipnet:211.28.0.0/14, country:AU]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: svn-src-stable-12@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for only the 12-stable src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Feb 2019 04:32:36 -0000 On Wed, 13 Feb 2019, Marius Strobl wrote: > As for the iflib(4) status in head, I'm aware of two remaining > user-visible regressions I ran myself into when trying to use > em(4) in production. I am aware of a few more: - tx throughput loss for minimal packets of about 10% on my low end/1 queue NICs (I218-V, older I2*, and 82541PI). This hasn't changed much in the 2+ years since em(4) was converted to iflib , except some versions were another 10-20% slower and some of the slowness can be recovered using the tx_abdicate sysctl - average ping latency loss of about 13% on I218V. This has only been there for 6-12 months. Of course this is with tuning for latency by turning off interrupt moderation as much as possible - errors on rx are recovered from badly in [l]em_isc_rxd_pkt_get() by incrementing the dropped packet count and returning EBADMSG. This leaves the hardware queues in a bad state which is recovered from after a long time by resetting. Many more packets are dropped, but the dropped packet count is only incremented by 1. The pre-iflib driver handled this by dropping just 1 packet and continuing. This is now hard to do, since iflib wants to build a list of packets and seems to have no way of handling bad packets in the list. I use the quick fix of printing a message and putting the bad packet in the list. I have only seen this problem on 82541PI. I haven't checked that the EBADMSG return is still mishandled by resetting. - the NIC is not stopped for media changes. This causes the same lockups as not stopping it for resume, but is less often a problem since you usually don't change the media for an active NIC. > 1) TX UDP performance is abysmal even when > using multiple queues and, thus, MSI-X. In a quick test with > netperf I see ~690 Mbits/s with 9216 bytes and 282 Mbits/s with > 42080 bytes on a Xeon E3-1245V2 and 82574 with GigE connection > (stable/11 e1000 drivers forward-ported to 12+ achieve 957 Mbit/s > in both cases). 2) TX TCP performance is abysmal when using MSI > or INTx (that's likely also PR 235031). > I have an upcoming iflib(4) fix for 2) but don't have an idea > what's causing 1) so far. I've identified two bugs in iflib(4) > that likely have a minimal (probably more so with ixl(4), though) > impact on UDP performance but don't explain the huge drop. I don't see bad performance for large packets (except for the 82541PI -- it is PCI and can't get near saturating the network at any size). Other problems: I mostly use i386, and its performance is now abysmal due to its slow syscalls. Its slowdowns also makes comparison with old benchmark results more difficult. Typical numbers for netblast tests for I218-V on i386 on Haswell i4790K 4.08GHz are: 1500 kpps (line rate) for tuned FreeBSD-11 using 1.5 CPUs 1400+ kpps for untuned FreeBSD-11 using 1 CPU 1400- kpps for -current-before-iflib using 1 CPU 1300- kpps for -current-after-iflib using 1.5 CPUs The tuning for FreeBSD-11 is just EM_MULTIQUEUE. The NIC has only 1 CPU, but using another CPU to manage the queue seems to work right. For iflib, the corresponding tuning seems to be to set the tx_abdicate sysctl to 1. This doesn't work so well. It causes iflib to mostly waste CPU by trying to do 2 context switches per packet (mostly from an idle thread to an iflib thread). The Haswell CPU can only do about 1 context switch per microsecond, so the context switches are worse than useless for achieving packet rates above 1000 kpps. In old versions of iflib, tx_abdicate is not a sysctl and is always enabled. This is why iflib takes an extra 0.5 CPUs in the above benchmark. Then for -current after both iflib and 4+4 address space changes: 533 kpps worst ever observed in -current (config unknown) 800 kkps typical result before pae_mode changes Then for -current now (after iflib, 4+4 and pae changes) 500 kkps pae_mode=1 (default) tx_abdicate=0 (default) 1 CPU 780 kpps pae_mode=0 tx_abdicate=0 (default) 1 CPU 591 kpps pae_mode=0 tx_abdicate=1 1.5 CPUs On amd64, the speed of syscalls hasn't changed much, so it still gets about 1200 kpps in untuned configurations, and tx_abdicate works better so it can almost reach line rate using a bit more CPU than tuned FreeBSD-11. The extra context switches can also be avoided by not using SMP or by binding the netblast thread to the same CPU as the main iflib thread. This only helps when tx_adbicate=1: 975 kpps pae_mode=0 tx_abdicate=1 cpuset -l5 1 CPU I.e., cpusetting improves the speed from 591 to 995 kpps! I now seem to remember that amd64 needed that too to get near line rate. The context switch counts for some cases are: - tx_abdicate=1, no cpuset: 1100+ k/sec (1 to and 1 from iflib thread per pkt) - tx_abdicate=0, no cpuset: 8 k/sec (this is from the corrected itr=125) - tx_abdicate=1, cpuset: 6 k/sec The iflib thread does one switch to and 1 switch from per packet, so the packet rate is half of its switch rate. But the switch rate of 1M shown by systat -v is wrong. It apparently doesn't include context switch for the cpu-idle threads. top -m i/o shows these. Context switches to and from the idle thread are cheaper than most, especially for i386 with 4+4 and pae, but they are still heavyweight so should be counted normally. Binding of of iflib threads to CPUs is another problem. It gets in the way of the scheduler choosing the best CPU dynamically, so is only obviously right if you have CPUs to spare. The 4BSD scheduler handles bound CPUs especially badly. This is fixed in my version, but I forgot to enable the fix for these test, and anyway, the fix and scheduling in general only makes much difference on moderately loaded systems. (For the light load of running only netblast and some daemons, there is CPU to spare. For heavy loads when there is no CPU to spare, the scheduler can't do much. My fixes with the Haswell 4x2 CPU topology reduce to trying to use only 1 CPU out of each HTT pair. So when iflib binds to CPU 6, if CPU 6 is running another thread, this thread has to be kicked off CPU 6 and should not be moved to CPU 7 while iflib is running on CPU 6. Even when there is another inactive HTT pair, moving it is slow.) iflib has some ifdefs for SCHED_ULE only. I doubt that static scheduling like it does can work well. It seems to do the opposite of what is right -- preferring threads on the same core make these threads run slower when they run concurrently, by competing for resources. The slowdown on Haswell for competing CPUs in an HTT pair is about 2/3 (each CPU runs about 1/3 slower so the speed of 2 CPUs is at best 4/3 times as much as 1 CPU). Anyway, iflib obviously doesn't understand scheduling, since its manual scheduling runs 975/591 times slower than my manual scheduling, without even any HTT contention or kicking netblast or another user thread off iflib's CPU. The slowness is just from kicking an idle thread off iflib's CPU. If there are really CPUs to spare, then the iflib thread should not yield to even the idle thread. Then it would work a bit like DEVICE_POLLING. I don't like polling, and would want to do this with something like halt waiting for an interrupt or better monitor waiting for a network event. cpu_idle() already does suitable things. DEVICE_POLLING somehow reduced ping latency by a lot (from 60+ usec to 30 usec) on older systems and NICs, at the cost of a lot of power for spinning in idle and not actually helping if the system is not idle. I don't see how it can do this. The interrupt latency with interrupt moderation turned off should be only about 1 usec. Summary: using unobvious tuning and small fixes, I can get ifllib'ed em to work almost as well as FreeBSD-11 em. Bruce