From owner-freebsd-net@FreeBSD.ORG Wed Oct 30 21:30:35 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 6EC105C8 for ; Wed, 30 Oct 2013 21:30:35 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id C5AEC253F for ; Wed, 30 Oct 2013 21:30:34 +0000 (UTC) Received: (qmail 64140 invoked from network); 30 Oct 2013 22:00:52 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Oct 2013 22:00:52 -0000 Message-ID: <52717A62.7040600@freebsd.org> Date: Wed, 30 Oct 2013 22:30:10 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Luigi Rizzo , Adrian Chadd , Navdeep Parhar , Randall Stewart , "freebsd-net@freebsd.org" Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.) References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <20131030050056.GA84368@onelab2.iet.unipi.it> In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Oct 2013 21:30:35 -0000 On 30.10.2013 06:00, Luigi Rizzo wrote: > On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote: >> Hi, >> >> We can't assume the hardware has deep queues _and_ we can't just hand >> packets to the DMA engine. >> [Adrian explains why] > > i have the feeling that the variuos folks who stepped into this > discussion seem to have completely different (and orthogonal) goals > and as such these goals should be discussed separately. It looks like it and it is great to have this discussion. :) > Below is the architecture i have in mind and how i would implement it > (and it would be extremely simple since we have most of the pieces > in place). [Omitted citation further down of good and throughout qos description, to be replied to separately] > It would be useful if people could discuss what problem they are > addressing before coming up with patches. Right now Glebius and I are working on the struct ifnet abstraction which has severely bloated and blurred over the years. The goal is to make is opaque to the drivers for better API/ABI stability in the first step. When looking at struct ifnet and its place in the kernel then it becomes evident that it's actual purpose is to serve as abstraction of a logical layer 3 protocol interface towards the layer 2 mapping and encapsulation, and eventually and sort of tangentially the real hardware. Now ifnet has become very complex and large and should be brought back to its original purpose of the being the logical layer 3 interface abstraction. There isn't necessarily a 1:1 mapping from one ifnet instance to one hardware interface. In fact there are pure logical ifnets (gre, tun, ...), direct hardware ifnets (simple network interfaces like fxp(4)), and multiple logic interfaces on top a single hardware (vlan, lagg, ...). Depending on the ifnets purpose the backend can be very different. Thus I want to decouple the current implicit notion of ifnet==hardware with associated queuing and such. Instead it should become a layer 3 abstraction inside the kernel again and delegate all lower layers to appropriate protocol, layer 2, and hardware specific implementations. From this comes the following *rough* implementation approach to be tested (ignore naming for now): /* Function pointers for packets descending into layer 2 */ (*if_l2map)(ifnet, mbuf, sockaddr, [route]); /* from upper stack */ (*if_tx)(ifnet, mbuf); /* to driver or qos */ (*if_txframe)(ifnet, mbuf); /* to driver */ (*if_txframedone)(ifnet); /* callback to qos */ /* Function pointers for packets coming up from layer 1 */ (*if_l2demap)(ifnet, mbuf); /* l2/l3 unmapping */ When a packet comes down that stack (*if_l2map) gets called to map and encapsulate a layer 3 packet into an appropriate layer 2 frame. For IP this would be ether_output() together with ARP and so on. The result of that step is the ethernet header in front of the IP packet. Ether_output() then calls (*if_tx) to have the frame sent out on the wire(less) which is the driver handoff point for DMA ring addition. Normally (*if_tx) and (*if_txframe) are the same and the job is done. When software QoS is active (*if_tx) points into the soft-qos enqueue implementation and will eventually use (*if_txframe) to push out those packets onto the wire it sees fit. In addition the drivers have to expose functions to manage the number and depth of their DMA rings, or rather the number/size of packets that can be enqueued onto them. And the (*if_txframedone) callback to clock out packets from a soft-queue or QoS discipline. When QoS is active it probably wants to make the DMA rings small and the software queue(s) large to be effective. As default setup and when running a server no QoS will be active or inserted. No or only very small software queues exist to handle concurrency (except for ieee80211 to do sophisticated frame management inside *if_txframe). Whenever the DMA ring is full there is no point in queuing up more packets. Instead the socket buffers act as buffers and also ensure flow control and backpressure up to userspace to limit kernel memory usage from double and triple buffering. How the packets are efficiently pushed out onto the wire is up to the drivers and depends on the hardware capabilities. It can be multiple hardware DMA rings, or just a single ring with an efficient concurrent access method. -- Andre