Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 30 Oct 2013 23:16:54 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        Luigi Rizzo <rizzo@iet.unipi.it>, Adrian Chadd <adrian@freebsd.org>,  Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>,  "freebsd-net@freebsd.org" <net@freebsd.org>
Subject:   Re: [long] Network stack -> NIC flow (was Re: MQ Patch.)
Message-ID:  <52718556.9010808@freebsd.org>
In-Reply-To: <20131030050056.GA84368@onelab2.iet.unipi.it>
References:  <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <CA%2BhQ2%2BgTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com> <20131030050056.GA84368@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help
On 30.10.2013 06:00, Luigi Rizzo wrote:
> On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote:
>> Hi,
>>
>> We can't assume the hardware has deep queues _and_ we can't just hand
>> packets to the DMA engine.
>> [Adrian explains why]

[skipping things replied to in other email]

> The architecture i think we should pursue is this (which happens to be
> what linux implements, and also what dummynet implements, except
> that the output is to a dummynet pipe or to ether_output() or to
> ip_output() depending on the configuration):
>
>     1. multiple (one per core) concurrent transmitters t_c

That's simply the number of cores that in theory could try to send
a packet at the time?  Or is it supposed to be an actual structure?

> 	which use ether_output_frame() to send to
>
>     2. multiple disjoint queues q_j
> 	(one per traffic group, can be *a lot*, say 10^6)

Whooo, that looks a bit excessive.  So many traffic groups would
effectively be one per flow?

Most of the time traffic is distributed into 4-8 classes with
strict priority for the highest class (VoIP) and some sort of
proportional WFQ for the others.  At least that's the standard
setup for carrier/ISP networks.

> 	which are scheduled with a scheduler S
>          (iterate step 2 for hierarchical schedulers)
> 	and

Makes sense.

>     3. eventually feed ONE transmit ring R_j on the NIC.

Agreed, more than one wouldn't work because otherwise the NIC would
do poor man's RR among the queues.

> 	Once a packet reaches R_j, for all practical purpose
> 	is on the wire. We cannot intercept extractions,
> 	we cannot interfere with the scheduler in the NIC in
> 	case of multiqueue NICs. The most we can do (and should,
> 	as in Linux) is notify the owner of the packet once its
> 	transmission is complete.

Per packet notification probably has a high overhead on high pps
systems.  The coalesced TX complete interrupt should do for QoS
purposes as well to keep the DMA ring fed.  We do not track who
generated the packet and thus can't have the notification bubble
up to the PCB (if any).

> Just to set the terminology:
> QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT
> 	or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES .
> 	This is what implements DROPTAIL (also improperly called FIFO),
> 	RED, CODEL. Note that for CODEL you need to intercept extractions
> 	from the queue, whereas DROPTAIL and RED only act on
> 	insertions.

Ack.

> SCHEDULER is the entity which decides which queue to serve among
> 	the many possible ones. It is called on INSERTIONS and
> 	EXTRACTIONS from a queue, and passes packets to the NIC's queue.

Ack.

> The decision on which queue and ring (Q_i and R_j) to use should be made
> by a classifier at the beginning of step 2 (or once per iteration,
> if using a hierarchical scheduler). Of course they can be precomputed
> (e.g. with annotations in the mbuf coming from the socket).

IMHO that is the job of a packet filter, or in simple cases can be
transposed into the mbuf header from vlan header cos or IP header
tos fields.

> Now when it comes to implementing the above, we have three
> cases (or different optimization levels, if you like)

-- 0. THE NO QOS CASE ---

No qos is done and multi DMA rings are selected based on the flowid
to reduce contention while avoiding packet reordering.

> -- 1. THE SIMPLE CASE ---
>
> In the simplest possible case we have can let the NIC do everything.
> Necessary conditions are:
> - queue management policies acting only on insertions
>    (e.g. DROPTAIL or RED or similar);
> - # of traffic classes <= # number of NIC rings
> - scheduling policy S equal to the one implemented in the NIC
>    (trivial case: one queue, one ring, no scheduler)
>
> All these cases match exactly what the hardware provides, so we can just
> use the NIC ring(s) without extra queue(s), and possibly use something
> like buf_ring to manage insertions (but note that insertions in
> an empty queue will end up requiring a lock; and i think the
> same happens even now with the extra drbr queue in front of the ring).

Agreed.  A lock on the DMA ring is always required to protect the ring
structure and NIC doorbell.  Software queuing or buf_ring shouldn't be
necessary at all.  Only some mechanism to make concurrent access/backoff
to the same DMA ring more efficient may be good.  For example having one
packet slot per core instead of spinning.

> -- 2. THE INTERMEDIATE CASE ---
>
> If we do not care about a scheduler but want a more complex QUEUE
> MANAGEMENT, such as CODEL, that acts on extractions, we _must_
> implement an intermediate queue Q_i before the NIC ring.  This is
> our only chance to act on extractions from the queue (which CODEL
> requires).  Note that we DO NOT NEED to create multiple queues for
> each ring.

As long as the NIC doesn't implement fair RR or interleaving among
multiple DMA rings any sort of queue management is futile.  Whenever
queue management is active only one DMA ring may be used and it should
be as small as possible to give maximum decision latitude to the queue
management.

> -- 3. THE COMPLETE CASE ---
>
> This is when the scheduler we want (DRR, WFQ variants, PRIORITY...)
> is not implemented in the NIC, or we have more queues than those
> available in the NIC. In this case we need to invoke this extra
> block before passing packets to the NIC.

Again the same as in 2. applies, just with a more complex soft queue
and scheduler.

> Remember that dummynet implements exactly #3, and it comes with a
> set of pretty efficient schedulers (i have made extensive measurements
> on them, see links to papers on my research page
> http://info.iet.unipi.it/~luigi/research.html ).
> They are by no means a performance bottleneck (scheduling takes
> 50..200ns depending on the circumstances) in the cases where
> it matters to have a scheduler (which is, when the sender is
> faster than the NIC, which in turn only happens with large packets
> which take 1..30us to get through at the very least..

Thanks for the information.

> --- IMPLEMENTATION ---
>
> Apart from ALTQ (which is very slow and has inefficient schedulers
> and i don't think anybody wants to maintain), and with the exception
> of dummynet which I'll discuss later, at the moment FreeBSD do not
> support schedulers in the tx path of the device driver.

I haven't really dug into ALTQ/dummynet yet, however from looking
over you seems to be very much right.

The basis for fresh generic QoS implementation should be dummynet
(in parallel to keep it intact).

> So we can only deal with cases 1 and 2, and for them the software
> queue + ring suffices to implement any QUEUE MANAGEMENT policy
> (but we don't implement anything).
>
> If we want support the generic case (#3), we should do the following:
>
> 1. device drivers export a function to transmit on an individual ring,
>    basically the current if_transmit(), and a hook to play with the
>    corresponding queue lock (the scheduler needs to run under lock,
>    and we can as well use the ring lock for that).
>    Note that the ether_output_frame does not always need to
>    call the scheduler: if a packet enters a non-empty queue, we are done.

OK.

> 2. device drivers also export the number of tx queues, and
>    some (advisory) information on queue status

OK.

> 3. ether_output_frame() runs the classifier (if needed), invokes
>    the scheduler (if needed) and possibly falls through into if_transmit()
>    for the specific ring.

OK.

> 4. on transmit completions (*_txeof(), typically), a callback invokes
>    the scheduler to feed the NIC ring with more packets

Ack.

> I mentioned dummynet: it already implements ALL of this,
> including the completion callback in #4. There is a hook
> in ether_output_frame(), and the hook was called (up to 8.0
> i believe) if_tx_rdy(). You can see wat it does in
> RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy()
>
> http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup
>
> if_tx_rdy() does not exist anymore because almost nobody used it,
> but it is trivial to reimplement, and can be called by device drivers
> when *_txeof() finds that is running low on packets _and_ the
> specific NIC needs to implement the "complete" scheduling.

Yup.

> The way it worked in dummynet (I think i used it in on 'tun' and 'ed')
> is also documented in the manpage:
> define a pipe whose bandwidth is set as a the device name instead
> of a number. Then you can attach a scheduler to the pipe, queues
> to the scheduler, and you are done.  Example:
>
>      // this is the scheduler's configuration
> 	ipfw pipe 10 config bw 'em2' sched
> 	ipfw sched 10 config type drr // deficit round robin
> 	ipfw queue 1 config weight 30 sched 10 // important
> 	ipfw queue 2 config weight 5 sched 10 // less important
> 	ipfw queue 3 config weight 1 sched 10 // who cares...
>
>      // and this is the classifier, which you can skip if the
>      // packets are already pre-classified.
>      // The infrastructure is already there to implement per-interface
>      // configurations.
> 	ipfw add queue 1 src-port 53
> 	ipfw add queue 2 src-port 22
> 	ipfw add queue 2 ip from any to any
>
> Now, surely we can replace the implementation of packet queues in dummynet
> from the current TAILQ to something resembling buf_ring to improve
> write parallelism; and a bit of glue code is needed to attach
> per-interface ipfw instances to each interface, and some smarts in
> the configuration commands is needed to figure out when we can
> bypass everything or not.

I'll experiment with variantions thereof.

> But this seems to me a much more viable approach to achieve proper QoS
> support in our architecture.

Indeed.  Let me get some code and prototypes going in the next weeks
and then pick up the discussion from there again.

-- 
Andre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?52718556.9010808>