From owner-freebsd-net@FreeBSD.ORG  Wed Oct 30 21:30:35 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 6EC105C8
 for <net@freebsd.org>; Wed, 30 Oct 2013 21:30:35 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id C5AEC253F
 for <net@freebsd.org>; Wed, 30 Oct 2013 21:30:34 +0000 (UTC)
Received: (qmail 64140 invoked from network); 30 Oct 2013 22:00:52 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <rizzo@iet.unipi.it>; 30 Oct 2013 22:00:52 -0000
Message-ID: <52717A62.7040600@freebsd.org>
Date: Wed, 30 Oct 2013 22:30:10 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>, Adrian Chadd <adrian@freebsd.org>, 
 Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>, 
 "freebsd-net@freebsd.org" <net@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>
 <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2013 21:30:35 -0000

On 30.10.2013 06:00, Luigi Rizzo wrote:
> On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote:
>> Hi,
>>
>> We can't assume the hardware has deep queues _and_ we can't just hand
>> packets to the DMA engine.
>> [Adrian explains why]
>
> i have the feeling that the variuos folks who stepped into this
> discussion seem to have completely different (and orthogonal) goals
> and as such these goals should be discussed separately.

It looks like it and it is great to have this discussion. :)

> Below is the architecture i have in mind and how i would implement it
> (and it would be extremely simple since we have most of the pieces
> in place).

[Omitted citation further down of good and throughout qos description,
  to be replied to separately]

> It would be useful if people could discuss what problem they are
> addressing before coming up with patches.

Right now Glebius and I are working on the struct ifnet abstraction
which has severely bloated and blurred over the years.  The goal is
to make is opaque to the drivers for better API/ABI stability in the
first step.

When looking at struct ifnet and its place in the kernel then it
becomes evident that it's actual purpose is to serve as abstraction
of a logical layer 3 protocol interface towards the layer 2 mapping
and encapsulation, and eventually and sort of tangentially the real
hardware.

Now ifnet has become very complex and large and should be brought
back to its original purpose of the being the logical layer 3 interface
abstraction.  There isn't necessarily a 1:1 mapping from one ifnet
instance to one hardware interface.  In fact there are pure logical
ifnets (gre, tun, ...), direct hardware ifnets (simple network interfaces
like fxp(4)), and multiple logic interfaces on top a single hardware
(vlan, lagg, ...).  Depending on the ifnets purpose the backend can
be very different.  Thus I want to decouple the current implicit
notion of ifnet==hardware with associated queuing and such.  Instead
it should become a layer 3 abstraction inside the kernel again and
delegate all lower layers to appropriate protocol, layer 2, and
hardware specific implementations.

 From this comes the following *rough* implementation approach to be
tested (ignore naming for now):

/* Function pointers for packets descending into layer 2 */
   (*if_l2map)(ifnet, mbuf, sockaddr, [route]);	/* from upper stack */
   (*if_tx)(ifnet, mbuf);			/* to driver or qos */
   (*if_txframe)(ifnet, mbuf);			/* to driver */
   (*if_txframedone)(ifnet);			/* callback to qos */

/* Function pointers for packets coming up from layer 1 */
   (*if_l2demap)(ifnet, mbuf);			/* l2/l3 unmapping */

When a packet comes down that stack (*if_l2map) gets called to map
and encapsulate a layer 3 packet into an appropriate layer 2 frame.
For IP this would be ether_output() together with ARP and so on.
The result of that step is the ethernet header in front of the IP
packet.  Ether_output() then calls (*if_tx) to have the frame sent
out on the wire(less) which is the driver handoff point for DMA
ring addition.  Normally (*if_tx) and (*if_txframe) are the same
and the job is done.  When software QoS is active (*if_tx) points
into the soft-qos enqueue implementation and will eventually use
(*if_txframe) to push out those packets onto the wire it sees fit.

In addition the drivers have to expose functions to manage the number
and depth of their DMA rings, or rather the number/size of packets
that can be enqueued onto them.  And the (*if_txframedone) callback
to clock out packets from a soft-queue or QoS discipline.  When QoS
is active it probably wants to make the DMA rings small and the
software queue(s) large to be effective.

As default setup and when running a server no QoS will be active
or inserted.  No or only very small software queues exist to handle
concurrency (except for ieee80211 to do sophisticated frame management
inside *if_txframe).  Whenever the DMA ring is full there is no point
in queuing up more packets.  Instead the socket buffers act as buffers
and also ensure flow control and backpressure up to userspace to limit
kernel memory usage from double and triple buffering.

How the packets are efficiently pushed out onto the wire is up to the
drivers and depends on the hardware capabilities.  It can be multiple
hardware DMA rings, or just a single ring with an efficient concurrent
access method.

-- 
Andre