From owner-freebsd-net@FreeBSD.ORG  Thu Oct 31 00:32:55 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id CF1DF140;
 Thu, 31 Oct 2013 00:32:55 +0000 (UTC)
 (envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
 by mx1.freebsd.org (Postfix) with ESMTP id 6EB092FD0;
 Thu, 31 Oct 2013 00:32:52 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
 id 0EC437300A; Thu, 31 Oct 2013 01:34:38 +0100 (CET)
Date: Thu, 31 Oct 2013 01:34:38 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Andre Oppermann <andre@freebsd.org>
Subject: Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
Message-ID: <20131031003438.GA10518@onelab2.iet.unipi.it>
References: <526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org>
 <5270462B.8050305@freebsd.org>
 <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
 <20131030050056.GA84368@onelab2.iet.unipi.it>
 <52718556.9010808@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <52718556.9010808@freebsd.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: Adrian Chadd <adrian@freebsd.org>,
 "freebsd-net@freebsd.org" <net@freebsd.org>, Navdeep Parhar <np@freebsd.org>,
 Randall Stewart <rrs@lakerest.net>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Oct 2013 00:32:55 -0000

On Wed, Oct 30, 2013 at 11:16:54PM +0100, Andre Oppermann wrote:
> On 30.10.2013 06:00, Luigi Rizzo wrote:
...
> [skipping things replied to in other email]

likewise, and let me thank you for the detailed comments.
I am adding a few comments myself below

> > The architecture i think we should pursue is this (which happens to be
> > what linux implements, and also what dummynet implements, except
> > that the output is to a dummynet pipe or to ether_output() or to
> > ip_output() depending on the configuration):
> >
> >     1. multiple (one per core) concurrent transmitters t_c
> 
> That's simply the number of cores that in theory could try to send
> a packet at the time?  Or is it supposed to be an actual structure?

it is just the number of cores that could potentially compete at
any time in using one scheduler

> > 	which use ether_output_frame() to send to
> >
> >     2. multiple disjoint queues q_j
> > 	(one per traffic group, can be *a lot*, say 10^6)
> 
> Whooo, that looks a bit excessive.  So many traffic groups would
> effectively be one per flow?

It depends on what you define as "flow", and i explictly did not
use the term as it is ambiguous. For me a traffic group is whatever
a classifier decides to put together.

The point of aiming for large number of classes is to avoid making
assumptions that will limit us in the future, eg. reserving a too
small field to represent the queue id, or statically allocating
queues, and the like.
Most schedulers in dummynet scale as O(1) with the number of classes,
so the only issue is having enough memory; and in any case
the actual max number of classes depends on the output of your classifier.

A lot of dummynet configurations (driving the upstream link for a
leaf netwrork, so right in front of  bottleneck) use a handful of
groups _per host_: say one for voip, one for dns/ssh, one for bulk
traffic, assigning different weights. A QFQ scheduler can easily
end up with a few thousands of queues and still efficiently achieve
fair sharing of bandwidth.

> Most of the time traffic is distributed into 4-8 classes with
> strict priority for the highest class (VoIP) and some sort of
> proportional WFQ for the others.  At least that's the standard
> setup for carrier/ISP networks.

This is for two reasons:
- the ISP does not need to care about individual hosts within the
  customer's network, but only (possibly) on the coarse classification
  that the customer has made via TOS/COS bits.
- boxes that only have a handful of queues handled with priority
  cost infinitely less than decent ones, so ISPs have an incentive
  in not separating individual customers (which they should do)
  especially if the SLA is "your upstream bandwidth is 1 Mbit/s,
  but the guaranteed bandwidth is 30 Kbit/s" (typical ADSL in italy).

But again, it is important that we support large sets of classes,
we do not necessarily have to use them.

> > 	Once a packet reaches R_j, for all practical purpose
> > 	is on the wire. We cannot intercept extractions,
> > 	we cannot interfere with the scheduler in the NIC in
> > 	case of multiqueue NICs. The most we can do (and should,
> > 	as in Linux) is notify the owner of the packet once its
> > 	transmission is complete.
> 
> Per packet notification probably has a high overhead on high pps
> systems.  The coalesced TX complete interrupt should do for QoS
> purposes as well to keep the DMA ring fed.  We do not track who
> generated the packet and thus can't have the notification bubble
> up to the PCB (if any).

I know we don't do it now, but linux does and performance is not
impacted badly.  Notifications can be easily batched and in the end
they only cause a selwakeup() . Anyways this can be retrofitted if
we have a reference from the mbuf to the owner/socket, and a pointer
to a callback.

> > The decision on which queue and ring (Q_i and R_j) to use should be made
> > by a classifier at the beginning of step 2 (or once per iteration,
> > if using a hierarchical scheduler). Of course they can be precomputed
> > (e.g. with annotations in the mbuf coming from the socket).
> 
> IMHO that is the job of a packet filter, or in simple cases can be
> transposed into the mbuf header from vlan header cos or IP header
> tos fields.

we are on sync here, just terminology differs.
A classifier is the first half of a packet filter (which first
classifies and then applies an action). And yes the classification
info can come from the headers.

cheers
luigi