Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Jan 2016 11:27:39 -0800
From:      Adrian Chadd <adrian@freebsd.org>
To:        Marcus Cenzatti <cenzatti@hush.com>
Cc:        Pavel Odintsov <pavel.odintsov@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org>,  Eduardo Meyer <dudu.meyer@gmail.com>
Subject:   Re: netmap design question - accessing netmap:X-n individual queues on FreeBSD
Message-ID:  <CAJ-VmonV2=LaEYtihJLRjKgTXQTiPzXJ9BY_py_08B0%2BbyC9Vw@mail.gmail.com>
In-Reply-To: <20160123184320.CA903A0126@smtp.hushmail.com>
References:  <CAEqdE_4ANVrGP2hKA4nT=AJqJ5M80A%2BHy2srjoe8wfugvmbypg@mail.gmail.com> <CALgsdbe7W2fPEWMnobXWebn63J9kYhupE-C=JM2xfQKBfnQwaw@mail.gmail.com> <CAJ-VmonCizOxfsV7kH_7GTHYXW8FgvjEV-8zt7qxE6b5tUadxg@mail.gmail.com> <CALgsdbdZOkpX-hzLVkLxz=1kGOjnNCNSZz4B1HfJy2hART4%2B0w@mail.gmail.com> <CAJ-VmokP2f5Uv5ubn2Cico_iXN20jL9tPuYHE2gomGLWkPLsUA@mail.gmail.com> <20160123184320.CA903A0126@smtp.hushmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
ok, so it's .. a little more complicated than that.

The chelsio hardware (thanks jim!) and intel hardware (thanks
sean/limelight!) do support various kinds of traffic hashing into
different queues. The common subset of behaviour is the microsoft RSS
requirement spec. You can hash on v4, v6 headers as well as
v4+(tcp,udp) ports and v6+(tcp,udp) ports. It depends on the traffic
and the RSS config.

Now, each NIC and driver has different defaults. This meant that yes,
each NIC and driver would distribute traffic differently, leading to
some pretty amusing differences in workload behaviour that was purely
due to how the driver defaults worked.

This is one of the motivations behind me pushing the freebsd rss stuff
along - I wanted the RX queue config code and the RSS hashing config
code to be explicitly done in the driver to match what the system
configuration is so that we /have/ that code. Otherwise it's arbitrary
- some drivers hash on just L3 headers, some hash on L3/L4 headers,
some hash fragments, some may not; some drivers populate the RSS hash
id in the mbuf flowid (cxgbe); some just put the queue id in there
(intel drivers.)

So when you use "RSS" in -HEAD, the NICs that support it hopefully
obey the config in sys/net/rss_config.c - which is to hash on TCP
ports for v4/v6 traffic, and just the L3 v4/v6 headers for everything
else. The NICS support hashing on UDP, but the challenge is that
fragments hash differently to non-fragments, so you have to
re-constitute the fragments back into a normal packet and rehash that
in software. Now, for TCP fragments are infrequent, but for UDP
they're quite frequent in some workloads. So, I defaulted UDP RSS off
and just let UDP be hashed to the L3 address.

This means that if you use RSS enabled, and you're using a supported
NIC (igb, ixgbe, cxgbe, ixl is currently broken, and I think the
mellanox driver does RSS now?) then it'll distribute like this:

* TCP traffic: will hash based on L3 (src,dst) and L4 (port);
* UDP traffic, will hash on L3 (src, dst) only;
* other (eg ICMP, GRE, etc) - hash on L3 (src, dst) only;
* non-IP traffic is currently hashed however the NIC does it. This
isn't currently expressed via RSS.

If you use RSS on a non-supported NIC, then it'll re-hash the packet
in software and distribute work to multiple netisr queues as
appropriate.

Now, my eventual (!) plan is to expose the RSS queue/key/hash config
per NIC rather than global, and then users can select what they want
for things like netmap, and let the kernel TCP/UDP stack code rehash
things as appropriate. But it's all spare time project for me at the
moment, and right now I'm debugging the ixl RSS code and preparing
some UDP stack changes to cope with uhm, "behaviour."

Ok, so with all of that - if you don't use RSS in HEAD, then the
traffic distribution will depend purely upon the driver defaults.
intel and chelsio should be hashing on TCP/UDP headers for those
packets, and just straight L3 only for others. There's no shared RSS
key - sometimes it's random (older intel drivers), sometimes it's
fixed (cxgbe) so you can end up with reboot-to-reboot variations with
intel drivers. If you're doing straight iperf testing, with one or a
small number of connections, it's quite likely you'll end up with only
a small subset of the RX queues being used. If you want to see the NIC
really distribute things, you should use pkt-gen with the random
port/IP options that I added in -HEAD about a year ago now.

Next - how well does it scale across multiple queues. So I noticed on
the cxgbe hardware that adding more RX queues actually caused the
aggregate RX throughput to drop by a few million pps. After Navdeep/I
poked at it, we and the chelsio hardware people concluded that because
netmap is doing one-buffer-is-one-packet on receive, the RX DMA engine
may be not keeping up feeding it descriptors and we really should be
supporting RX buffer batching. It doesn't show up at 10g line rate,
only when you're trying to hit the NIC theoretical max on 40g. Yeah,
we hit the NIC theoretical max on 40g with one RX queue, but then we
couldn't do any work - that's why I was trying to farm it out to
multiple queues via hardware.

Finally - some drivers had some very .. silly defaults for interrupt
handling. The chelsio driver was generating a lot of notifications in
netmap mode, which navdeep heavily tuned and fixed whilst we were
digging into 40g behaviour. The intel interrupt moderation code
assumes you're not using netmap and the netmap code doesn't increment
the right counters, so AIM is just plainly broken and you end up with
crazy high varying interrupt rates which really slow things down.

(there's also NUMA related things at high pps rates, but I'm going to
ignore this for now.)

I hope this helps.



-adrian



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmonV2=LaEYtihJLRjKgTXQTiPzXJ9BY_py_08B0%2BbyC9Vw>