Date: Wed, 30 Oct 2013 06:00:56 +0100 From: Luigi Rizzo <rizzo@iet.unipi.it> To: Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>, Navdeep Parhar <np@freebsd.org>, Randall Stewart <rrs@lakerest.net>, "freebsd-net@freebsd.org" <net@freebsd.org> Subject: [long] Network stack -> NIC flow (was Re: MQ Patch.) Message-ID: <20131030050056.GA84368@onelab2.iet.unipi.it> In-Reply-To: <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com> References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <CA%2BhQ2%2BgTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> <5270309E.5090403@FreeBSD.org> <5270462B.8050305@freebsd.org> <CAJ-Vmo=6thETmTDrPYRjwNTEQaTWkSTKdRYN3eRw5xVhsvr5RQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 29, 2013 at 06:43:21PM -0700, Adrian Chadd wrote: > Hi, > > We can't assume the hardware has deep queues _and_ we can't just hand > packets to the DMA engine. > [Adrian explains why] i have the feeling that the variuos folks who stepped into this discussion seem to have completely different (and orthogonal) goals and as such these goals should be discussed separately. Below is the architecture i have in mind and how i would implement it (and it would be extremely simple since we have most of the pieces in place). It would be useful if people could discuss what problem they are addressing before coming up with patches. --- The architecture i think we should pursue is this (which happens to be what linux implements, and also what dummynet implements, except that the output is to a dummynet pipe or to ether_output() or to ip_output() depending on the configuration): 1. multiple (one per core) concurrent transmitters t_c which use ether_output_frame() to send to 2. multiple disjoint queues q_j (one per traffic group, can be *a lot*, say 10^6) which are scheduled with a scheduler S (iterate step 2 for hierarchical schedulers) and 3. eventually feed ONE transmit ring R_j on the NIC. Once a packet reaches R_j, for all practical purpose is on the wire. We cannot intercept extractions, we cannot interfere with the scheduler in the NIC in case of multiqueue NICs. The most we can do (and should, as in Linux) is notify the owner of the packet once its transmission is complete. Just to set the terminology: QUEUE MANAGEMENT POLICY refers to decisions that we make when we INSERT or EXTRACT packets from a queue WITHOUT LOOKING AT OTHER QUEUES . This is what implements DROPTAIL (also improperly called FIFO), RED, CODEL. Note that for CODEL you need to intercept extractions from the queue, whereas DROPTAIL and RED only act on insertions. SCHEDULER is the entity which decides which queue to serve among the many possible ones. It is called on INSERTIONS and EXTRACTIONS from a queue, and passes packets to the NIC's queue. The decision on which queue and ring (Q_i and R_j) to use should be made by a classifier at the beginning of step 2 (or once per iteration, if using a hierarchical scheduler). Of course they can be precomputed (e.g. with annotations in the mbuf coming from the socket). Now when it comes to implementing the above, we have three cases (or different optimization levels, if you like) -- 1. THE SIMPLE CASE --- In the simplest possible case we have can let the NIC do everything. Necessary conditions are: - queue management policies acting only on insertions (e.g. DROPTAIL or RED or similar); - # of traffic classes <= # number of NIC rings - scheduling policy S equal to the one implemented in the NIC (trivial case: one queue, one ring, no scheduler) All these cases match exactly what the hardware provides, so we can just use the NIC ring(s) without extra queue(s), and possibly use something like buf_ring to manage insertions (but note that insertions in an empty queue will end up requiring a lock; and i think the same happens even now with the extra drbr queue in front of the ring). -- 2. THE INTERMEDIATE CASE --- If we do not care about a scheduler but want a more complex QUEUE MANAGEMENT, such as CODEL, that acts on extractions, we _must_ implement an intermediate queue Q_i before the NIC ring. This is our only chance to act on extractions from the queue (which CODEL requires). Note that we DO NOT NEED to create multiple queues for each ring. -- 3. THE COMPLETE CASE --- This is when the scheduler we want (DRR, WFQ variants, PRIORITY...) is not implemented in the NIC, or we have more queues than those available in the NIC. In this case we need to invoke this extra block before passing packets to the NIC. Remember that dummynet implements exactly #3, and it comes with a set of pretty efficient schedulers (i have made extensive measurements on them, see links to papers on my research page http://info.iet.unipi.it/~luigi/research.html ). They are by no means a performance bottleneck (scheduling takes 50..200ns depending on the circumstances) in the cases where it matters to have a scheduler (which is, when the sender is faster than the NIC, which in turn only happens with large packets which take 1..30us to get through at the very least.. --- IMPLEMENTATION --- Apart from ALTQ (which is very slow and has inefficient schedulers and i don't think anybody wants to maintain), and with the exception of dummynet which I'll discuss later, at the moment FreeBSD do not support schedulers in the tx path of the device driver. So we can only deal with cases 1 and 2, and for them the software queue + ring suffices to implement any QUEUE MANAGEMENT policy (but we don't implement anything). If we want support the generic case (#3), we should do the following: 1. device drivers export a function to transmit on an individual ring, basically the current if_transmit(), and a hook to play with the corresponding queue lock (the scheduler needs to run under lock, and we can as well use the ring lock for that). Note that the ether_output_frame does not always need to call the scheduler: if a packet enters a non-empty queue, we are done. 2. device drivers also export the number of tx queues, and some (advisory) information on queue status 3. ether_output_frame() runs the classifier (if needed), invokes the scheduler (if needed) and possibly falls through into if_transmit() for the specific ring. 4. on transmit completions (*_txeof(), typically), a callback invokes the scheduler to feed the NIC ring with more packets I mentioned dummynet: it already implements ALL of this, including the completion callback in #4. There is a hook in ether_output_frame(), and the hook was called (up to 8.0 i believe) if_tx_rdy(). You can see wat it does in RELENG_4, sys/netinet/ip_dummynet.c :: if_tx_rdy() http://svnweb.freebsd.org/base/stable/4/sys/netinet/ip_dummynet.c?revision=123994&view=markup if_tx_rdy() does not exist anymore because almost nobody used it, but it is trivial to reimplement, and can be called by device drivers when *_txeof() finds that is running low on packets _and_ the specific NIC needs to implement the "complete" scheduling. The way it worked in dummynet (I think i used it in on 'tun' and 'ed') is also documented in the manpage: define a pipe whose bandwidth is set as a the device name instead of a number. Then you can attach a scheduler to the pipe, queues to the scheduler, and you are done. Example: // this is the scheduler's configuration ipfw pipe 10 config bw 'em2' sched ipfw sched 10 config type drr // deficit round robin ipfw queue 1 config weight 30 sched 10 // important ipfw queue 2 config weight 5 sched 10 // less important ipfw queue 3 config weight 1 sched 10 // who cares... // and this is the classifier, which you can skip if the // packets are already pre-classified. // The infrastructure is already there to implement per-interface // configurations. ipfw add queue 1 src-port 53 ipfw add queue 2 src-port 22 ipfw add queue 2 ip from any to any Now, surely we can replace the implementation of packet queues in dummynet from the current TAILQ to something resembling buf_ring to improve write parallelism; and a bit of glue code is needed to attach per-interface ipfw instances to each interface, and some smarts in the configuration commands is needed to figure out when we can bypass everything or not. But this seems to me a much more viable approach to achieve proper QoS support in our architecture. cheers luigi cheers luigi
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131030050056.GA84368>