From owner-svn-src-user@FreeBSD.ORG Thu Jan 7 14:05:41 2010 Return-Path: Delivered-To: svn-src-user@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2CDB2106568B; Thu, 7 Jan 2010 14:05:41 +0000 (UTC) (envelope-from luigi@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id 1ABC68FC18; Thu, 7 Jan 2010 14:05:41 +0000 (UTC) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id o07E5fEo072127; Thu, 7 Jan 2010 14:05:41 GMT (envelope-from luigi@svn.freebsd.org) Received: (from luigi@localhost) by svn.freebsd.org (8.14.3/8.14.3/Submit) id o07E5fPv072123; Thu, 7 Jan 2010 14:05:41 GMT (envelope-from luigi@svn.freebsd.org) Message-Id: <201001071405.o07E5fPv072123@svn.freebsd.org> From: Luigi Rizzo Date: Thu, 7 Jan 2010 14:05:41 +0000 (UTC) To: src-committers@freebsd.org, svn-src-user@freebsd.org X-SVN-Group: user MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r201746 - user/luigi/ipfw3-head/sys/netinet/ipfw X-BeenThere: svn-src-user@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the experimental " user" src tree" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jan 2010 14:05:41 -0000 Author: luigi Date: Thu Jan 7 14:05:40 2010 New Revision: 201746 URL: http://svn.freebsd.org/changeset/base/201746 Log: snapshot -- split I/O from configuration functions. Added: user/luigi/ipfw3-head/sys/netinet/ipfw/ip_dn_io.c (contents, props changed) Modified: user/luigi/ipfw3-head/sys/netinet/ipfw/dummynet.txt user/luigi/ipfw3-head/sys/netinet/ipfw/ip_dummynet.c Modified: user/luigi/ipfw3-head/sys/netinet/ipfw/dummynet.txt ============================================================================== --- user/luigi/ipfw3-head/sys/netinet/ipfw/dummynet.txt Thu Jan 7 13:53:47 2010 (r201745) +++ user/luigi/ipfw3-head/sys/netinet/ipfw/dummynet.txt Thu Jan 7 14:05:40 2010 (r201746) @@ -131,15 +131,32 @@ can use its own struct to store its queu Three global data structures (hash tables) contain all pipes, schedulers and flowsets. - pipehash[x]: contains all pipes in the system + not needed to be efficient - we never do a lookup + in a critical section + - schedulerhash[x]: contains all scheduler templates in the system + this needs to be a hash table because we may have to do + a lookup upon arrival of a packet. + We have one entry per 'sched X config' command + (plus one for each 'pipe X config'). + - flowsethash[x]: contains all flowset linked with a scheduler (or pipe). + we do a lookup on this for each packet. + We have one entry for each 'queue X config' + (plus one for each 'pipe X config'). + Additionally, a list that contains all unlinked flowset: - unlinkedflowset: contains flowset that are not linked with any scheduler -flowset are put in this list when they refer to a non -existing scheduler or pipe. + flowset are put in this list when they refer to a non + existing scheduler or pipe. + We keep them out of the main hash table because we do not + want to send packets to those flowsets. + We don't need an efficient data structure as we never search + here on a packet arrival -- at most we put here a flowset + for which a scheduler does not exist anymore. -Scheduler instances and the delay lines associated with pipes -need to be woken up at certain times. Because we have many +Scheduler instances and the delay lines associated with each scheduler +instance need to be woken up at certain times. Because we have many such objects, we keep them in a priority heap (system_heap). Almost all objects in this implementation are preceded Added: user/luigi/ipfw3-head/sys/netinet/ipfw/ip_dn_io.c ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ user/luigi/ipfw3-head/sys/netinet/ipfw/ip_dn_io.c Thu Jan 7 14:05:40 2010 (r201746) @@ -0,0 +1,1327 @@ +/*- + * Copyright (c) 2010 Luigi Rizzo, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * Dummynet portions related to packet handling. + */ +#include +__FBSDID("$FreeBSD$"); + +#define DUMMYNET_DEBUG + +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +//#include +#include +#include +//#include +#include /* IFNAMSIZ, struct ifaddr, ifq head, lock.h mutex.h */ +#include +#include +#include /* ip_len, ip_off */ +#include /* ip_output(), IP_FORWARDING */ +#include +#include +#include +#include +#include + +#include /* various ether_* routines */ + +#include /* for ip6_input, ip6_output prototypes */ +#include + +/* + * We keep a private variable for the simulation time, but we could + * probably use an existing one ("softticks" in sys/kern/kern_timeout.c) + */ +static dn_key curr_time = 0 ; /* current simulation time */ + +/* statistics on number of queue searches and search steps */ +static long searches, search_steps ; +static int pipe_expire = 1 ; /* expire queue if empty */ +static int dn_max_ratio = 16 ; /* max queues/buckets ratio */ + +struct dn_parms dn_cfg = { + .pipe_slot_limit = 100, /* Foot shooting limit for pipe queues. */ + .pipe_byte_limit = 1024 * 1024, + + .hash_size = 64, /* default hash size */ + .red_lookup_depth = 256, /* RED - default lookup table depth */ + .red_avg_pkt_size = 512, /* RED - default medium packet size */ + .red_max_pkt_size = 1500, /* RED - default max packet size */ +}; + +//static struct timeval t; +static long tick_last; /* Last tick duration (usec). */ +static long tick_delta; /* Last vs standard tick diff (usec). */ +static long tick_delta_sum; /* Accumulated tick difference (usec).*/ +static long tick_adjustment; /* Tick adjustments done. */ +static long tick_lost; /* Lost(coalesced) ticks number. */ +/* Adjusted vs non-adjusted curr_time difference (ticks). */ +static long tick_diff; + +static unsigned long io_pkt; +static unsigned long io_pkt_fast; +static unsigned long io_pkt_drop; + +/* + * Three heaps contain queues and pipes that the scheduler handles: + * + * ready_heap contains all dn_flow_queue related to fixed-rate pipes. + * + * wfq_ready_heap contains the pipes associated with WF2Q flows + * + * extract_heap contains pipes associated with delay lines. + * + * The key for the heap is used for two different values: + * + * 1. timer ticks- max 10K/second, so 32 bits are enough; + * + * 2. virtual times. These increase in steps of len/x, where len is the + * packet length, and x is either the weight of the flow, or the + * sum of all weights. + * If we limit to max 1000 flows and a max weight of 100, then + * x needs 17 bits. The packet size is 16 bits, so we can easily + * overflow if we do not allow errors. + * So we use a key "dn_key" which is 64 bits. Some macros are used to + * compare key values and handle wraparounds. + * MAX64 returns the largest of two key values. + * MY_M is used as a shift count when doing fixed point arithmetic + * (a better name would be useful...). + */ +#define MAX64(x,y) (( (int64_t) ( (y)-(x) )) > 0 ) ? (y) : (x) +#define MY_M 16 /* number of left shift to obtain a larger precision */ + +/* + * XXX With this scaling, max 1000 flows, max weight 100, 1Gbit/s, the + * virtual time wraps every 15 days. + */ + +MALLOC_DEFINE(M_DUMMYNET, "dummynet", "dummynet heap"); + +static void transmit_event(struct dn_pipe *pipe, struct mbuf **head, + struct mbuf **tail); +static void ready_event(struct dn_flow_queue *q, struct mbuf **head, + struct mbuf **tail); +static void ready_event_wfq(struct dn_pipe *p, struct mbuf **head, + struct mbuf **tail); + +struct dn_heap ready_heap, extract_heap, wfq_ready_heap ; +struct dn_pipe_head pipehash[DN_HASHSIZE]; /* all pipes */ +struct dn_flow_set_head flowsethash[DN_HASHSIZE]; /* all flowsets */ + +extern void (*bridge_dn_p)(struct mbuf *, struct ifnet *); + +#ifdef SYSCTL_NODE +SYSCTL_DECL(_net_inet); +SYSCTL_DECL(_net_inet_ip); + +SYSCTL_NODE(_net_inet_ip, OID_AUTO, dummynet, CTLFLAG_RW, 0, "Dummynet"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, hash_size, + CTLFLAG_RW, &dn_cfg.hash_size, 0, "Default hash table size"); +#if 0 /* curr_time is 64 bit */ +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, curr_time, + CTLFLAG_RD, &curr_time, 0, "Current tick"); +#endif +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, ready_heap, + CTLFLAG_RD, &ready_heap.size, 0, "Size of ready heap"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, extract_heap, + CTLFLAG_RD, &extract_heap.size, 0, "Size of extract heap"); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, searches, + CTLFLAG_RD, &searches, 0, "Number of queue searches"); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, search_steps, + CTLFLAG_RD, &search_steps, 0, "Number of queue search steps"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, expire, + CTLFLAG_RW, &pipe_expire, 0, "Expire queue if empty"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, max_chain_len, + CTLFLAG_RW, &dn_max_ratio, 0, + "Max ratio between dynamic queues and buckets"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, red_lookup_depth, + CTLFLAG_RD, &dn_cfg.red_lookup_depth, 0, "Depth of RED lookup table"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, red_avg_pkt_size, + CTLFLAG_RD, &dn_cfg.red_avg_pkt_size, 0, "RED Medium packet size"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, red_max_pkt_size, + CTLFLAG_RD, &dn_cfg.red_max_pkt_size, 0, "RED Max packet size"); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_delta, + CTLFLAG_RD, &tick_delta, 0, "Last vs standard tick difference (usec)."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_delta_sum, + CTLFLAG_RD, &tick_delta_sum, 0, "Accumulated tick difference (usec)."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_adjustment, + CTLFLAG_RD, &tick_adjustment, 0, "Tick adjustments done."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_diff, + CTLFLAG_RD, &tick_diff, 0, + "Adjusted vs non-adjusted curr_time difference (ticks)."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_lost, + CTLFLAG_RD, &tick_lost, 0, + "Number of ticks coalesced by dummynet taskqueue."); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, io_fast, + CTLFLAG_RW, &dn_cfg.io_fast, 0, "Enable fast dummynet io."); +SYSCTL_ULONG(_net_inet_ip_dummynet, OID_AUTO, io_pkt, + CTLFLAG_RD, &io_pkt, 0, + "Number of packets passed to dummynet."); +SYSCTL_ULONG(_net_inet_ip_dummynet, OID_AUTO, io_pkt_fast, + CTLFLAG_RD, &io_pkt_fast, 0, + "Number of packets bypassed dummynet scheduler."); +SYSCTL_ULONG(_net_inet_ip_dummynet, OID_AUTO, io_pkt_drop, + CTLFLAG_RD, &io_pkt_drop, 0, + "Number of packets dropped by dummynet."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, pipe_slot_limit, + CTLFLAG_RW, &dn_cfg.pipe_slot_limit, 0, "Upper limit in slots for pipe queue."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, pipe_byte_limit, + CTLFLAG_RW, &dn_cfg.pipe_byte_limit, 0, "Upper limit in bytes for pipe queue."); +#endif + +#ifdef DUMMYNET_DEBUG +int dummynet_debug = 0; +#ifdef SYSCTL_NODE +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, debug, CTLFLAG_RW, &dummynet_debug, + 0, "control debugging printfs"); +#endif +#define DPRINTF(X) if (dummynet_debug) printf X +#else +#define DPRINTF(X) +#endif + +struct mtx dummynet_mtx; + +static void dummynet_send(struct mbuf *); + +/* + * Flow queue is idle if: + * 1) it's empty for at least 1 tick + * 2) it has invalid timestamp (WF2Q case) + * 3) parent pipe has no 'exhausted' burst. + */ +#define QUEUE_IS_IDLE(q) ((q)->head == NULL && (q)->S == (q)->F + 1 && \ + curr_time > (q)->idle_time + 1 && \ + ((q)->numbytes + (curr_time - (q)->idle_time - 1) * \ + (q)->fs->pipe->bandwidth >= (q)->fs->pipe->burst)) + +/* + * Packets processed by dummynet have an mbuf tag associated with + * them that carries their dummynet state. This is used within + * the dummynet code as well as outside when checking for special + * processing requirements. + * Note that the first part is the reinject info and is common to + * other forms of packet reinjection. + */ +struct dn_pkt_tag { + struct ipfw_rule_ref rule; /* matching rule */ + + /* second part, dummynet specific */ + int dn_dir; /* action when packet comes out. */ + /* see ip_fw_private.h */ + dn_key output_time; /* when the pkt is due for delivery */ + struct ifnet *ifp; /* interface, for ip_output */ + struct _ip6dn_args ip6opt; /* XXX ipv6 options */ +}; + +/* + * Return the mbuf tag holding the dummynet state. As an optimization + * this is assumed to be the first tag on the list. If this turns out + * wrong we'll need to search the list. + */ +static struct dn_pkt_tag * +dn_tag_get(struct mbuf *m) +{ + struct m_tag *mtag = m_tag_first(m); + KASSERT(mtag != NULL && + mtag->m_tag_cookie == MTAG_ABI_COMPAT && + mtag->m_tag_id == PACKET_TAG_DUMMYNET, + ("packet on dummynet queue w/o dummynet tag!")); + return (struct dn_pkt_tag *)(mtag+1); +} + +/* + * Scheduler functions: + * + * transmit_event() is called when the delay-line needs to enter + * the scheduler, either because of existing pkts getting ready, + * or new packets entering the queue. The event handled is the delivery + * time of the packet. + * + * ready_event() does something similar with fixed-rate queues, and the + * event handled is the finish time of the head pkt. + * + * wfq_ready_event() does something similar with WF2Q queues, and the + * event handled is the start time of the head pkt. + * + * In all cases, we make sure that the data structures are consistent + * before passing pkts out, because this might trigger recursive + * invocations of the procedures. + */ +static void +transmit_event(struct dn_pipe *pipe, struct mbuf **head, struct mbuf **tail) +{ + struct mbuf *m; + struct dn_pkt_tag *pkt; + + DUMMYNET_LOCK_ASSERT(); + + while ((m = pipe->head) != NULL) { + pkt = dn_tag_get(m); + if (!DN_KEY_LEQ(pkt->output_time, curr_time)) + break; + + pipe->head = m->m_nextpkt; + if (*tail != NULL) + (*tail)->m_nextpkt = m; + else + *head = m; + *tail = m; + } + if (*tail != NULL) + (*tail)->m_nextpkt = NULL; + + /* If there are leftover packets, put into the heap for next event. */ + if ((m = pipe->head) != NULL) { + pkt = dn_tag_get(m); + /* + * XXX Should check errors on heap_insert, by draining the + * whole pipe p and hoping in the future we are more successful. + */ + heap_insert(&extract_heap, pkt->output_time, pipe); + } +} + +#define div64(a, b) ((int64_t)(a) / (int64_t)(b)) +/* + * Compute how many ticks we have to wait before being able to send + * a packet. This is computed as the "wire time" for the packet + * (length + extra bits), minus the credit available, scaled to ticks. + * Check that the result is not be negative (it could be if we have + * too much leftover credit in q->numbytes). + */ +static inline dn_key +set_ticks(struct mbuf *m, struct dn_flow_queue *q, struct dn_pipe *p) +{ + int64_t ret; + + ret = div64( (m->m_pkthdr.len * 8 + q->extra_bits) * hz + - q->numbytes + p->bandwidth - 1 , p->bandwidth); + if (ret < 0) + ret = 0; + return ret; +} + +/* + * Convert the additional MAC overheads/delays into an equivalent + * number of bits for the given data rate. The samples are in milliseconds + * so we need to divide by 1000. + */ +static dn_key +compute_extra_bits(struct mbuf *pkt, struct dn_pipe *p) +{ + int index; + dn_key extra_bits; + + if (!p->samples || p->samples_no == 0) + return 0; + index = random() % p->samples_no; + extra_bits = div64((dn_key)p->samples[index] * p->bandwidth, 1000); + if (index >= p->loss_level) { + struct dn_pkt_tag *dt = dn_tag_get(pkt); + if (dt) + dt->dn_dir = DIR_DROP; + } + return extra_bits; +} + +/* + * extract pkt from queue, compute output time (could be now) + * and put into delay line (p_queue) + */ +static void +move_pkt(struct mbuf *pkt, struct dn_flow_queue *q, struct dn_pipe *p, + int len) +{ + struct dn_pkt_tag *dt = dn_tag_get(pkt); + + q->head = pkt->m_nextpkt ; + q->len-- ; + q->len_bytes -= len ; + + dt->output_time = curr_time + p->delay ; + + if (p->head == NULL) + p->head = pkt; + else + p->tail->m_nextpkt = pkt; + p->tail = pkt; + p->tail->m_nextpkt = NULL; +} + +/* + * ready_event() is invoked every time the queue must enter the + * scheduler, either because the first packet arrives, or because + * a previously scheduled event fired. + * On invokation, drain as many pkts as possible (could be 0) and then + * if there are leftover packets reinsert the pkt in the scheduler. + */ +static void +ready_event(struct dn_flow_queue *q, struct mbuf **head, struct mbuf **tail) +{ + struct mbuf *pkt; + struct dn_pipe *p = q->fs->pipe; + int p_was_empty; + + DUMMYNET_LOCK_ASSERT(); + + if (p == NULL) { + printf("dummynet: ready_event- pipe is gone\n"); + return; + } + p_was_empty = (p->head == NULL); + + /* + * Schedule fixed-rate queues linked to this pipe: + * account for the bw accumulated since last scheduling, then + * drain as many pkts as allowed by q->numbytes and move to + * the delay line (in p) computing output time. + * bandwidth==0 (no limit) means we can drain the whole queue, + * setting len_scaled = 0 does the job. + */ + q->numbytes += (curr_time - q->sched_time) * p->bandwidth; + while ((pkt = q->head) != NULL) { + int len = pkt->m_pkthdr.len; + dn_key len_scaled = p->bandwidth ? len*8*hz + + q->extra_bits*hz + : 0; + + if (DN_KEY_GT(len_scaled, q->numbytes)) + break; + q->numbytes -= len_scaled; + move_pkt(pkt, q, p, len); + if (q->head) + q->extra_bits = compute_extra_bits(q->head, p); + } + /* + * If we have more packets queued, schedule next ready event + * (can only occur when bandwidth != 0, otherwise we would have + * flushed the whole queue in the previous loop). + * To this purpose we record the current time and compute how many + * ticks to go for the finish time of the packet. + */ + if ((pkt = q->head) != NULL) { /* this implies bandwidth != 0 */ + dn_key t = set_ticks(pkt, q, p); /* ticks i have to wait */ + + q->sched_time = curr_time; + heap_insert(&ready_heap, curr_time + t, (void *)q); + /* + * XXX Should check errors on heap_insert, and drain the whole + * queue on error hoping next time we are luckier. + */ + } else /* RED needs to know when the queue becomes empty. */ + q->idle_time = curr_time; + + /* + * If the delay line was empty call transmit_event() now. + * Otherwise, the scheduler will take care of it. + */ + if (p_was_empty) + transmit_event(p, head, tail); +} + +/* callback to clean the idle heap */ +static int +clean_fq(void *_q, uintptr_t arg) +{ + struct dn_flow_queue *q = _q; + + q->F = 0; + q->S = q->F + 1; + return HEAP_SCAN_DEL; +} + +/* + * Called when we can transmit packets on WF2Q queues. Take pkts out of + * the queues at their start time, and enqueue into the delay line. + * Packets are drained until p->numbytes < 0. As long as + * len_scaled >= p->numbytes, the packet goes into the delay line + * with a deadline p->delay. For the last packet, if p->numbytes < 0, + * there is an additional delay. + */ +static void +ready_event_wfq(struct dn_pipe *p, struct mbuf **head, struct mbuf **tail) +{ + int p_was_empty = (p->head == NULL); + struct dn_heap *sch = p->scheduler_heap; + struct dn_heap *neh = p->not_eligible_heap; + int64_t p_numbytes = p->numbytes; + + /* + * p->numbytes is only 32bits in FBSD7, but we might need 64 bits. + * Use a local variable for the computations, and write back the + * results when done, saturating if needed. + * The local variable has no impact on performance and helps + * reducing diffs between the various branches. + */ + + DUMMYNET_LOCK_ASSERT(); + + if (p->if_name[0] == 0) /* tx clock is simulated */ + p_numbytes += (curr_time - p->sched_time) * p->bandwidth; + else { /* + * tx clock is for real, + * the ifq must be empty or this is a NOP. + */ + if (p->ifp && p->ifp->if_snd.ifq_head != NULL) + return; + else { + DPRINTF(("dummynet: pipe %d ready from %s --\n", + p->pipe_nr, p->if_name)); + } + } + + /* + * While we have backlogged traffic AND credit, we need to do + * something on the queue. + */ + while (p_numbytes >= 0 && (sch->elements > 0 || neh->elements > 0)) { + if (sch->elements > 0) { + /* Have some eligible pkts to send out. */ + struct dn_flow_queue *q = HEAP_TOP(sch)->object; + struct mbuf *pkt = q->head; + struct dn_flow_set *fs = q->fs; + uint64_t len = pkt->m_pkthdr.len; + int len_scaled = p->bandwidth ? len * 8 * hz : 0; + + heap_extract(sch, NULL); /* Remove queue from heap. */ + p_numbytes -= len_scaled; + move_pkt(pkt, q, p, len); + + p->V += div64((len << MY_M), p->sum); /* Update V. */ + q->S = q->F; /* Update start time. */ + if (q->len == 0) { + /* Flow not backlogged any more. */ + fs->backlogged--; + heap_insert(p->idle_heap, q->F, q); + } else { + /* Still backlogged. */ + + /* + * Update F and position in backlogged queue, + * then put flow in not_eligible_heap + * (we will fix this later). + */ + len = (q->head)->m_pkthdr.len; + q->F += div64((len << MY_M), fs->weight); + if (DN_KEY_LEQ(q->S, p->V)) + heap_insert(neh, q->S, q); + else + heap_insert(sch, q->F, q); + } + } + /* + * Now compute V = max(V, min(S_i)). Remember that all elements + * in sch have by definition S_i <= V so if sch is not empty, + * V is surely the max and we must not update it. Conversely, + * if sch is empty we only need to look at neh. + */ + if (sch->elements == 0 && neh->elements > 0) + p->V = MAX64(p->V, HEAP_TOP(neh)->key); + /* Move from neh to sch any packets that have become eligible */ + while (neh->elements > 0 && DN_KEY_LEQ(HEAP_TOP(neh)->key, p->V)) { + struct dn_flow_queue *q = HEAP_TOP(neh)->object; + heap_extract(neh, NULL); + heap_insert(sch, q->F, q); + } + + if (p->if_name[0] != '\0') { /* Tx clock is from a real thing */ + p_numbytes = -1; /* Mark not ready for I/O. */ + break; + } + } + if (sch->elements == 0 && neh->elements == 0 && p_numbytes >= 0) { + p->idle_time = curr_time; + /* + * No traffic and no events scheduled. + * We can get rid of idle-heap. + */ + if (p->idle_heap->elements > 0) { + heap_scan(p->idle_heap, clean_fq, 0); + p->sum = 0; + p->V = 0; + p->idle_heap->elements = 0; + } + } + /* + * If we are getting clocks from dummynet (not a real interface) and + * If we are under credit, schedule the next ready event. + * Also fix the delivery time of the last packet. + */ + if (p->if_name[0]==0 && p_numbytes < 0) { /* This implies bw > 0. */ + dn_key t = 0; /* Number of ticks i have to wait. */ + + if (p->bandwidth > 0) + t = div64(p->bandwidth - 1 - p_numbytes, p->bandwidth); + dn_tag_get(p->tail)->output_time += t; + p->sched_time = curr_time; + heap_insert(&wfq_ready_heap, curr_time + t, (void *)p); + /* + * XXX Should check errors on heap_insert, and drain the whole + * queue on error hoping next time we are luckier. + */ + } + + /* Write back p_numbytes (adjust 64->32bit if necessary). */ + p->numbytes = p_numbytes; + + /* + * If the delay line was empty call transmit_event() now. + * Otherwise, the scheduler will take care of it. + */ + if (p_was_empty) + transmit_event(p, head, tail); +} + +/* + * The timer handler for dummynet. Time is computed in ticks, but + * but the code is tolerant to the actual rate at which this is called. + * Once complete, the function reschedules itself for the next tick. + */ +void +dummynet_task(void *context, int pending) +{ + struct mbuf *head = NULL, *tail = NULL; + struct dn_pipe *pipe; + struct dn_heap *heaps[3]; + struct dn_heap *h; + void *p; /* generic parameter to handler */ + int i; + struct timeval t; + + DUMMYNET_LOCK(); + + heaps[0] = &ready_heap; /* fixed-rate queues */ + heaps[1] = &wfq_ready_heap; /* wfq queues */ + heaps[2] = &extract_heap; /* delay line */ + + /* Update number of lost(coalesced) ticks. */ + tick_lost += pending - 1; + + getmicrouptime(&t); + /* Last tick duration (usec). */ + tick_last = (t.tv_sec - dn_cfg.prev_t.tv_sec) * 1000000 + + (t.tv_usec - dn_cfg.prev_t.tv_usec); + /* Last tick vs standard tick difference (usec). */ + tick_delta = (tick_last * hz - 1000000) / hz; + /* Accumulated tick difference (usec). */ + tick_delta_sum += tick_delta; + + dn_cfg.prev_t = t; + + /* + * Adjust curr_time if accumulated tick difference greater than + * 'standard' tick. Since curr_time should be monotonically increasing, + * we do positive adjustment as required and throttle curr_time in + * case of negative adjustment. + */ + curr_time++; + if (tick_delta_sum - tick >= 0) { + int diff = tick_delta_sum / tick; + + curr_time += diff; + tick_diff += diff; + tick_delta_sum %= tick; + tick_adjustment++; + } else if (tick_delta_sum + tick <= 0) { + curr_time--; + tick_diff--; + tick_delta_sum += tick; + tick_adjustment++; + } + + for (i = 0; i < 3; i++) { + h = heaps[i]; + while (h->elements > 0 && DN_KEY_LEQ(HEAP_TOP(h)->key, curr_time)) { + // XXX can this happen ? + if (HEAP_TOP(h)->key > curr_time) + printf("dummynet: warning, " + "heap %d is %d ticks late\n", + i, (int)(curr_time - HEAP_TOP(h)->key)); + /* store a copy before heap_extract */ + p = HEAP_TOP(h)->object; + /* need to extract before processing */ + heap_extract(h, NULL); + if (i == 0) + ready_event(p, &head, &tail); + else if (i == 1) { + struct dn_pipe *pipe = p; + if (pipe->if_name[0] != '\0') + printf("dummynet: bad ready_event_wfq " + "for pipe %s\n", pipe->if_name); + else + ready_event_wfq(p, &head, &tail); + } else + transmit_event(p, &head, &tail); + } + } + + /* Sweep pipes trying to expire idle flow_queues. */ + for (i = 0; i < DN_HASHSIZE; i++) { + SLIST_FOREACH(pipe, &pipehash[i], next) { + if (pipe->idle_heap->elements > 0 && + DN_KEY_LT(HEAP_TOP(pipe->idle_heap)->key, pipe->V)) { + struct dn_flow_queue *q = + HEAP_TOP(pipe->idle_heap)->object; + + heap_extract(pipe->idle_heap, NULL); + /* Mark timestamp as invalid. */ + q->S = q->F + 1; + pipe->sum -= q->fs->weight; + } + } + } + + DUMMYNET_UNLOCK(); + + if (head != NULL) + dummynet_send(head); + + dn_reschedule(); +} + +static void +dummynet_send(struct mbuf *m) +{ + struct mbuf *n; + + for (; m != NULL; m = n) { + struct ifnet *ifp; + int dst; + struct m_tag *tag; + + n = m->m_nextpkt; + m->m_nextpkt = NULL; + tag = m_tag_first(m); + if (tag == NULL) { + dst = DIR_DROP; + } else { + struct dn_pkt_tag *pkt = dn_tag_get(m); + /* extract the dummynet info, rename the tag */ + dst = pkt->dn_dir; + ifp = pkt->ifp; + /* rename the tag so it carries reinject info */ + tag->m_tag_cookie = MTAG_IPFW_RULE; + tag->m_tag_id = 0; + } + + switch (dst) { + case DIR_OUT: + SET_HOST_IPLEN(mtod(m, struct ip *)); + ip_output(m, NULL, NULL, IP_FORWARDING, NULL, NULL); + break ; + case DIR_IN : + /* put header in network format for ip_input() */ + //SET_NET_IPLEN(mtod(m, struct ip *)); + netisr_dispatch(NETISR_IP, m); + break; +#ifdef INET6 + case DIR_IN | PROTO_IPV6: + netisr_dispatch(NETISR_IPV6, m); + break; + + case DIR_OUT | PROTO_IPV6: + SET_HOST_IPLEN(mtod(m, struct ip *)); + ip6_output(m, NULL, NULL, IPV6_FORWARDING, NULL, NULL, NULL); + break; +#endif + case DIR_FWD | PROTO_IFB: /* DN_TO_IFB_FWD: */ + if (bridge_dn_p != NULL) + ((*bridge_dn_p)(m, ifp)); + else + printf("dummynet: if_bridge not loaded\n"); + + break; + case DIR_IN | PROTO_LAYER2: /* DN_TO_ETH_DEMUX: */ + /* + * The Ethernet code assumes the Ethernet header is + * contiguous in the first mbuf header. + * Insure this is true. + */ + if (m->m_len < ETHER_HDR_LEN && + (m = m_pullup(m, ETHER_HDR_LEN)) == NULL) { + printf("dummynet/ether: pullup failed, " + "dropping packet\n"); + break; + } + ether_demux(m->m_pkthdr.rcvif, m); + break; + case DIR_OUT | PROTO_LAYER2: /* N_TO_ETH_OUT: */ + ether_output_frame(ifp, m); + break; + + case DIR_DROP: + /* drop the packet after some time */ + FREE_PKT(m); + break; + + default: + printf("dummynet: bad switch %d!\n", dst); + FREE_PKT(m); + break; + } + } +} + +/* + * Unconditionally expire empty queues in case of shortage. + * Returns the number of queues freed. + */ +static int +expire_queues(struct dn_flow_set *fs) +{ + struct dn_flow_queue *q, *prev ; + int i, initial_elements = fs->rq_elements ; + + if (fs->last_expired == time_uptime) + return 0 ; + fs->last_expired = time_uptime ; + for (i = 0 ; i <= fs->rq_size ; i++) { /* last one is overflow */ + for (prev=NULL, q = fs->rq[i] ; q != NULL ; ) { + if (!QUEUE_IS_IDLE(q)) { + prev = q ; + q = q->next ; + } else { /* entry is idle, expire it */ + struct dn_flow_queue *old_q = q ; + + if (prev != NULL) + prev->next = q = q->next ; + else + fs->rq[i] = q = q->next ; + fs->rq_elements-- ; + free(old_q, M_DUMMYNET); + } + } + } + return initial_elements - fs->rq_elements ; +} + +/* + * If room, create a new queue and put at head of slot i; + * otherwise, create or use the default queue. + */ +static struct dn_flow_queue * +create_queue(struct dn_flow_set *fs, int i) +{ + struct dn_flow_queue *q; + + if (fs->rq_elements > fs->rq_size * dn_max_ratio && + expire_queues(fs) == 0) { + /* No way to get room, use or create overflow queue. */ + i = fs->rq_size; + if (fs->rq[i] != NULL) + return fs->rq[i]; + } + q = malloc(sizeof(*q), M_DUMMYNET, M_NOWAIT | M_ZERO); + if (q == NULL) { + printf("dummynet: sorry, cannot allocate queue for new flow\n"); + return (NULL); + } + q->fs = fs; + q->hash_slot = i; + q->next = fs->rq[i]; + q->S = q->F + 1; /* hack - mark timestamp as invalid. */ + q->numbytes = fs->pipe->burst + (dn_cfg.io_fast ? fs->pipe->bandwidth : 0); + fs->rq[i] = q; + fs->rq_elements++; + return (q); +} + +/* + * Given a flow_set and a pkt in last_pkt, find a matching queue + * after appropriate masking. The queue is moved to front + * so that further searches take less time. + */ +static struct dn_flow_queue * +find_queue(struct dn_flow_set *fs, struct ipfw_flow_id *id) +{ + int i = 0 ; /* we need i and q for new allocations */ + struct dn_flow_queue *q, *prev; + int is_v6 = IS_IP6_FLOW_ID(id); + + if ( !(fs->flags_fs & DN_HAVE_FLOW_MASK) ) + q = fs->rq[0] ; + else { + /* first, do the masking, then hash */ + id->dst_port &= fs->flow_mask.dst_port ; + id->src_port &= fs->flow_mask.src_port ; + id->proto &= fs->flow_mask.proto ; + id->flags = 0 ; /* we don't care about this one */ + if (is_v6) { + APPLY_MASK(&id->dst_ip6, &fs->flow_mask.dst_ip6); + APPLY_MASK(&id->src_ip6, &fs->flow_mask.src_ip6); + id->flow_id6 &= fs->flow_mask.flow_id6; + + i = ((id->dst_ip6.__u6_addr.__u6_addr32[0]) & 0xffff)^ + ((id->dst_ip6.__u6_addr.__u6_addr32[1]) & 0xffff)^ + ((id->dst_ip6.__u6_addr.__u6_addr32[2]) & 0xffff)^ + ((id->dst_ip6.__u6_addr.__u6_addr32[3]) & 0xffff)^ + + ((id->dst_ip6.__u6_addr.__u6_addr32[0] >> 15) & 0xffff)^ + ((id->dst_ip6.__u6_addr.__u6_addr32[1] >> 15) & 0xffff)^ + ((id->dst_ip6.__u6_addr.__u6_addr32[2] >> 15) & 0xffff)^ + ((id->dst_ip6.__u6_addr.__u6_addr32[3] >> 15) & 0xffff)^ + + ((id->src_ip6.__u6_addr.__u6_addr32[0] << 1) & 0xfffff)^ + ((id->src_ip6.__u6_addr.__u6_addr32[1] << 1) & 0xfffff)^ + ((id->src_ip6.__u6_addr.__u6_addr32[2] << 1) & 0xfffff)^ + ((id->src_ip6.__u6_addr.__u6_addr32[3] << 1) & 0xfffff)^ + + ((id->src_ip6.__u6_addr.__u6_addr32[0] << 16) & 0xffff)^ + ((id->src_ip6.__u6_addr.__u6_addr32[1] << 16) & 0xffff)^ + ((id->src_ip6.__u6_addr.__u6_addr32[2] << 16) & 0xffff)^ + ((id->src_ip6.__u6_addr.__u6_addr32[3] << 16) & 0xffff)^ + + (id->dst_port << 1) ^ (id->src_port) ^ + (id->proto ) ^ + (id->flow_id6); + } else { + id->dst_ip &= fs->flow_mask.dst_ip ; + id->src_ip &= fs->flow_mask.src_ip ; + + i = ( (id->dst_ip) & 0xffff ) ^ + ( (id->dst_ip >> 15) & 0xffff ) ^ + ( (id->src_ip << 1) & 0xffff ) ^ + ( (id->src_ip >> 16 ) & 0xffff ) ^ + (id->dst_port << 1) ^ (id->src_port) ^ + (id->proto ); + } + i = i % fs->rq_size ; + /* finally, scan the current list for a match */ + searches++ ; + for (prev=NULL, q = fs->rq[i] ; q ; ) { + search_steps++; + if (is_v6 && + IN6_ARE_ADDR_EQUAL(&id->dst_ip6,&q->id.dst_ip6) && + IN6_ARE_ADDR_EQUAL(&id->src_ip6,&q->id.src_ip6) && + id->dst_port == q->id.dst_port && + id->src_port == q->id.src_port && + id->proto == q->id.proto && + id->flags == q->id.flags && + id->flow_id6 == q->id.flow_id6) + break ; /* found */ + + if (!is_v6 && id->dst_ip == q->id.dst_ip && + id->src_ip == q->id.src_ip && + id->dst_port == q->id.dst_port && + id->src_port == q->id.src_port && + id->proto == q->id.proto && + id->flags == q->id.flags) + break ; /* found */ + + /* No match. Check if we can expire the entry */ + if (pipe_expire && QUEUE_IS_IDLE(q)) { + /* entry is idle and not in any heap, expire it */ + struct dn_flow_queue *old_q = q ; + + if (prev != NULL) + prev->next = q = q->next ; + else + fs->rq[i] = q = q->next ; + fs->rq_elements-- ; *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***