From owner-freebsd-net@freebsd.org Sat Jan 23 19:27:40 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8F78A8ED37 for ; Sat, 23 Jan 2016 19:27:40 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-ig0-x22b.google.com (mail-ig0-x22b.google.com [IPv6:2607:f8b0:4001:c05::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 81B8E1FE3 for ; Sat, 23 Jan 2016 19:27:40 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by mail-ig0-x22b.google.com with SMTP id t15so12145745igr.0 for ; Sat, 23 Jan 2016 11:27:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=p6A/f3gjiiwGDJg3So/yNxa/HLxwLCSZBU3U034E1O8=; b=LLRLZq0hXeCCcQiL9J8iiR1oqKJBZWIG9cHn8tcoiSJmvncic1yymN5eiTg3qzcxVz lQE5O/gOseYdN3N2829ChtU1KmpwhXe26co0EUJ5fDtgPchHQLe3zqIPPzYQXMJbI0jl X7NeRkbyHIS8WEmSXms6yBD0tudKrw1QpKoolzJr2/57DV9kj/+JF9pE7TEcOpGdf+3k pmcDgyBXHb44kukg8R4fCpEdkY1ofZJyHQvRCPP8w9mGaOZJTGvTVOlrQia6L8NA5IHz DVzk9KcnCMfwejMx/rcf67nd/NOg8cujANMCUKf6yeEbNze/shgY0F5v6F7E3aJN9y1Y n5uA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=p6A/f3gjiiwGDJg3So/yNxa/HLxwLCSZBU3U034E1O8=; b=ewhOwH3g9Kr1XhROk2WdHK+gojhOIHJkKr5ONGUiMXtWvHyPCNpnrvT76gUuHtuDsi zfLVpaZxCNL+WZffIoHTh/TTifyASmkGsFtyIUlCqTb55Vz//4XmNWcGTCz9MuWLyUCQ VjsaIzW90c3mtJaC1ONQN0rISqydF9f0W/1Y5oXwp5gTbu+WE6A8NC6i9mWZY9YcLNc+ LKm7kGFDswR7gxrFy3HHYDe3uGb4wbymfWmDG2uKG++MeRefIG9BQq7bOLf1xnaKo6Er mltpAXP51ptKDHslzd375gBZpTUx9Aw8ptCYyayLEsAR9c5nRMpARPtPliUYHhJkpvr9 PZTA== X-Gm-Message-State: AG10YOQzHrxxvmbnps4wXhN/uE8Zr32jjmZMCSPNGlU0oMaSbAeioT6zxqawbRy/CZ9waVkO4TqA/3F5rq83dQ== MIME-Version: 1.0 X-Received: by 10.50.150.36 with SMTP id uf4mr9272791igb.61.1453577259832; Sat, 23 Jan 2016 11:27:39 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.36.121.16 with HTTP; Sat, 23 Jan 2016 11:27:39 -0800 (PST) In-Reply-To: <20160123184320.CA903A0126@smtp.hushmail.com> References: <20160123184320.CA903A0126@smtp.hushmail.com> Date: Sat, 23 Jan 2016 11:27:39 -0800 X-Google-Sender-Auth: vAwOVImR1GQZbSp2ljc1VHIGELk Message-ID: Subject: Re: netmap design question - accessing netmap:X-n individual queues on FreeBSD From: Adrian Chadd To: Marcus Cenzatti Cc: Pavel Odintsov , FreeBSD Net , Eduardo Meyer Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2016 19:27:40 -0000 ok, so it's .. a little more complicated than that. The chelsio hardware (thanks jim!) and intel hardware (thanks sean/limelight!) do support various kinds of traffic hashing into different queues. The common subset of behaviour is the microsoft RSS requirement spec. You can hash on v4, v6 headers as well as v4+(tcp,udp) ports and v6+(tcp,udp) ports. It depends on the traffic and the RSS config. Now, each NIC and driver has different defaults. This meant that yes, each NIC and driver would distribute traffic differently, leading to some pretty amusing differences in workload behaviour that was purely due to how the driver defaults worked. This is one of the motivations behind me pushing the freebsd rss stuff along - I wanted the RX queue config code and the RSS hashing config code to be explicitly done in the driver to match what the system configuration is so that we /have/ that code. Otherwise it's arbitrary - some drivers hash on just L3 headers, some hash on L3/L4 headers, some hash fragments, some may not; some drivers populate the RSS hash id in the mbuf flowid (cxgbe); some just put the queue id in there (intel drivers.) So when you use "RSS" in -HEAD, the NICs that support it hopefully obey the config in sys/net/rss_config.c - which is to hash on TCP ports for v4/v6 traffic, and just the L3 v4/v6 headers for everything else. The NICS support hashing on UDP, but the challenge is that fragments hash differently to non-fragments, so you have to re-constitute the fragments back into a normal packet and rehash that in software. Now, for TCP fragments are infrequent, but for UDP they're quite frequent in some workloads. So, I defaulted UDP RSS off and just let UDP be hashed to the L3 address. This means that if you use RSS enabled, and you're using a supported NIC (igb, ixgbe, cxgbe, ixl is currently broken, and I think the mellanox driver does RSS now?) then it'll distribute like this: * TCP traffic: will hash based on L3 (src,dst) and L4 (port); * UDP traffic, will hash on L3 (src, dst) only; * other (eg ICMP, GRE, etc) - hash on L3 (src, dst) only; * non-IP traffic is currently hashed however the NIC does it. This isn't currently expressed via RSS. If you use RSS on a non-supported NIC, then it'll re-hash the packet in software and distribute work to multiple netisr queues as appropriate. Now, my eventual (!) plan is to expose the RSS queue/key/hash config per NIC rather than global, and then users can select what they want for things like netmap, and let the kernel TCP/UDP stack code rehash things as appropriate. But it's all spare time project for me at the moment, and right now I'm debugging the ixl RSS code and preparing some UDP stack changes to cope with uhm, "behaviour." Ok, so with all of that - if you don't use RSS in HEAD, then the traffic distribution will depend purely upon the driver defaults. intel and chelsio should be hashing on TCP/UDP headers for those packets, and just straight L3 only for others. There's no shared RSS key - sometimes it's random (older intel drivers), sometimes it's fixed (cxgbe) so you can end up with reboot-to-reboot variations with intel drivers. If you're doing straight iperf testing, with one or a small number of connections, it's quite likely you'll end up with only a small subset of the RX queues being used. If you want to see the NIC really distribute things, you should use pkt-gen with the random port/IP options that I added in -HEAD about a year ago now. Next - how well does it scale across multiple queues. So I noticed on the cxgbe hardware that adding more RX queues actually caused the aggregate RX throughput to drop by a few million pps. After Navdeep/I poked at it, we and the chelsio hardware people concluded that because netmap is doing one-buffer-is-one-packet on receive, the RX DMA engine may be not keeping up feeding it descriptors and we really should be supporting RX buffer batching. It doesn't show up at 10g line rate, only when you're trying to hit the NIC theoretical max on 40g. Yeah, we hit the NIC theoretical max on 40g with one RX queue, but then we couldn't do any work - that's why I was trying to farm it out to multiple queues via hardware. Finally - some drivers had some very .. silly defaults for interrupt handling. The chelsio driver was generating a lot of notifications in netmap mode, which navdeep heavily tuned and fixed whilst we were digging into 40g behaviour. The intel interrupt moderation code assumes you're not using netmap and the netmap code doesn't increment the right counters, so AIM is just plainly broken and you end up with crazy high varying interrupt rates which really slow things down. (there's also NUMA related things at high pps rates, but I'm going to ignore this for now.) I hope this helps. -adrian