From owner-freebsd-hackers@FreeBSD.ORG Wed Feb 6 14:21:07 2013 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id F0854C4F; Wed, 6 Feb 2013 14:21:07 +0000 (UTC) (envelope-from melifaro@FreeBSD.org) Received: from mail.ipfw.ru (unknown [IPv6:2a01:4f8:120:6141::2]) by mx1.freebsd.org (Postfix) with ESMTP id BB139DDB; Wed, 6 Feb 2013 14:21:07 +0000 (UTC) Received: from v6.mpls.in ([2a02:978:2::5] helo=ws.su29.net) by mail.ipfw.ru with esmtpsa (TLSv1:CAMELLIA256-SHA:256) (Exim 4.76 (FreeBSD)) (envelope-from ) id 1U35vR-000HbU-Mn; Wed, 06 Feb 2013 18:24:37 +0400 Message-ID: <5112666F.3050904@FreeBSD.org> Date: Wed, 06 Feb 2013 18:19:27 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20120121 Thunderbird/9.0 MIME-Version: 1.0 To: net@freebsd.org, freebsd-hackers@FreeBSD.org Subject: Make kernel aware of NIC queues Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Feb 2013 14:21:08 -0000 Hello list! Today more and more NICs are capable of splitting traffic to different Rx/TX rings permitting OS to dispatch this traffic on different CPU cores. However, there are some problems that arises from using multi-nic (or even singe multi-port NIC) configurations: Typical (OS) questions are: * how much queues we should allocate per port ? * how we should mark packets received in given queue ? * What traffic pattern NIC is used for: should we bind queues to CPU cores and, if so, to which ones? Currently, there are some AI implemented in Intel drivers like: * use maximum available queues if CPU has large number of cores * bind every queue to CPU core sequentially. Problems with (probably, any AI) are: * what NICs (ports) will be _actually_ used? E.g: I have 8-core system with dual 82576 Intel NIC (which is capable of using 8 RX queues per port). If only one port is used, I can allocate 8 (or 7) queues and bind it to given cores. which is generally good for forwarding traffic. For 2-port setups it is probably better to setup 4 queues per each port to make sure ithreads from different cards to not interfere with each other. * How exactly we should mark packets? There are traffic flows which are not hashed properly by NIC (mostly non-IP/IPv6 traffic, PPPoE, various tunnels are good examples) so driver receives all such packets on q0 and marks them with FLOWID 0, which can be unhandy in some situations. It can be better if we can instruct NIC not to mark such packets with any id permitting OS to re-calculate hash via probably more powerful netisr hash function. * Traffic flow inside OS / flowid marking Smarter flowid marking may be needed in some cases: for example, if we are using lagg with 2 NICs for traffic forwarding, this results in increased contention on transmit parts: From the previos example: port 0 has q0-q3 bound to cores 0-3 port 1 has q0-q3 bound to cores 4-7 flow ids are the same as core numbers. lagg uses (flowid % number_nics) which leads to TX contention: 0 (0 % 2)=port0, (0 % 4)=queue0 1 (1 % 2)=port1, (1 % 4)=queue1 2 (2 % 2)=port0, (2 % 4)=queue2 3 (3 % 2)=port1, (3 % 4)=queue3 4 (4 % 2)=port0, (4 % 4)=queue0 5 (5 % 2)=port1, (5 % 4)=queue1 6 (6 % 2)=port0, (6 % 4)=queue2 7 (7 % 2)=port1, (7 % 4)=queue3 Flow IDs 0 and 4, 1 and 5, 2 and 6, 3 and 7 use the same TX queues on the same egress NICs. This can be minimized by using either GCD(queues, ports)=1 configurations (3 queues should do the trick in this case), but this leads to suboptimal CPU usage. We internally uses patched igb/ix driver which permits setting flow ids manually (and I heard other people are using hacks to enable/disabling setting M_FLOWID). I propose implementing common API to permit drivers: * read user-supplied number of queues/other queue options (e.g: * notify kernel of each RX/TX queue being created/destroyed * make binding queues to cores via given API * Export data to userland (for example, via sysctl) to permit users: a) quickly see current configuration b) change CPU binding on-fly c) change flowid numbers on-fly (with the possibility to set 1) NIC-supplied hash 2) manually supplied value 3) disable setting M_FLOWID) Having common interface will help users to make network stack tuning easier and puts us one step further to make (probably userland) AI which can auto-tune system according to template ("router", "webserver") and rc.conf configuration (lagg presense, etc..) What do you guys think?