Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Feb 2014 09:18:11 +0000
From:      "Eggert, Lars" <lars@netapp.com>
To:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Cc:        Midori Kato <katoon@sfc.wide.ad.jp>
Subject:   DCTCP for FreeBSD
Message-ID:  <BE8726B3-7AC9-4CB8-8D12-E05F54AB59AB@netapp.com>

next in thread | raw e-mail | index | archive | help
--Apple-Mail=_BAE04A14-8EAB-4CA7-89EF-860A13E6E143
Content-Type: multipart/mixed;
	boundary="Apple-Mail=_5AB129EF-65C5-4C0E-ACC4-CE048EBD9985"


--Apple-Mail=_5AB129EF-65C5-4C0E-ACC4-CE048EBD9985
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

Hi,

Midori Kato has implemented Microsoft's/Stanford's Datacenter TCP =
(DCTCP) for FreeBSD as part of her MS thesis with me. Find a patch =
attached.

Also note that we're documenting a specification for DCTCP in an IETF =
draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp

Microsoft has made a licensing statement (RAND-Z) on the technology to =
the IETF: https://datatracker.ietf.org/ipr/2319/ (I'm not sure what this =
means for an eventual inclusion in FreeBSD.)

Roughly, Midori's patch consists of an extension of the modular =
congestion control framework to expose ECN information to the modules, a =
module to implement DCTCP, and a few experimental variants. See Midori's =
explanation:

> [1] A change for the modular congestion control framework (See Section =
4.1 if needed)
> DCTCP uses the difference ECN processing from RFC3168. We need to =
prepare three functions to do the following ECN processing.=20
>  a) The kernel decides whether an ECE flag should be set in the next =
outgoing TCP segment by snooping reserved bits in IP and TCP headers. =
(tcp_input.c)
>  b) The kernel controls a congestion if an ECE flag is set in an =
arriving TCP segment. (tcp_input.c)
>  c) After the outgoing TCP segment is generated, the kernel decides =
whether an ECT bit should be set in an ECN field of IP header in the =
outgoing packet. (tcp_output.c)
> The current framework has no housekeeping functions for (a) and (b). =
Therefore, I add two functions into the moduler cc framework: =
ecnpkt_handler() and ect_handler().
>=20
> - ecnpkt_handler() allows the kernel to do the additional ECN =
processing by snooping ECN field in IP and TCP headers. As an option, =
this function takes a flag, which tells whether this function is in the =
delayed ACK. This function returns an integer value. When the return =
value is set, the kernel force to disable delayed ACK.
> - ect_handler() allows the kernel to use different rule from RFC3168 =
in terms of an ECT marking in the outgoing segment. This function =
returns an integer value. If the value is set, an ECT bit is set to the =
outgoing segment.
>=20
>=20
> [2] Five changes from the original DCTCP algorithm
> In order to reflect the DCTCP motivation, I modified the following =
processing. First four modifications are for senders and the last =
modification is for receivers.
>=20
> (1) no congestion recovery in the receipt of ECE flags (See section =
4.2.1 if needed)
> FreeBSD handles ECN as a congestion event but it's not true for DCTCP =
senders. A DCTCP sender uses ECN as a means to understand the extent of =
congestions. Therefore, I remove congestion recovery mode in any =
situation for DCTCP senders.
>=20
> (2) selective initial alpha value (See section 4.2.2 if needed)=20
> DCTCP defines alpha as a parameter to see the depth of a congestion. =
When the alpha value is large, it allows a saw-toothed CWND behavior to =
a DCTCP sender.
> A problem is that the alpha value is not reliable during a dozen of =
RTTs because there is no way to identify the depth of a congestion over =
a network from the beginning. When considering the alpha reliability, I =
think the initial alpha should be selective for applications by users. =
When a user chooses DCTCP for latency-sensitive applications, the =
initial alpha is preferred. Otherwise, DCTCP senders had better to set =
the initial alpha value to zero from my experimental result (See section =
7.2 of attaching file).
> The default alpha value is set to zero in my implementation.
>=20
> (3) alpha value initialization after an idle period (See section 4.2.3 =
if needed)
> How long an idle period is no longer predictable. Therefore, for a =
DCTCP sender, using the out-dated alpha after an idle period is not good =
idea. A DCTCP sender resets alpha to the initial value when an idle =
period occurs.
>=20
> The following changes is applied to eliminate a compatibility issue to =
standard ECN defined in RFC3465. DCTCP and standard ECN servers have no =
way to identify which mechanism is working on the peer. Thus, we need to =
eliminate the worst situation in a network mixing DCTCP =
senders/receivers and standard ECN senders/receivers.
> (4) using CWR flag when the ECE flag is found for a RTT (See section =
5.1 if needed)
> This change is applied for a situation when a sender uses DCTCP and a =
reciever uses standard ECN.=20
> Under the situation, I find that a DCTCP sender minimizes CWND. The =
detailed technical reason is described in section 4.2 of my attaching =
file. Fortunately, the current tcp_input()  function complement this =
change, thus, there is no modification in my patch.
>=20
> (5) enabling delayed ACK in the receipt of the CWR flag (See section =
5.2 if needed)
> This change is applied for a situation when a sender uses standard ECN =
and a reciever uses DCTCP. Under the situation, I find that a standard =
ECN sender increases smaller CWND than expected without this change. The =
detailed technical reason is described in section 5.2 of my attaching =
file.


The patch is attached and should apply to a recent -CURRENT. Midori's =
thesis (which she refers to in the quoted text above) is at =
https://eggert.org/students/kato-thesis.pdf

Lars


--Apple-Mail=_5AB129EF-65C5-4C0E-ACC4-CE048EBD9985
Content-Disposition: attachment;
	filename=dctcp.patch
Content-Type: application/octet-stream;
	name="dctcp.patch"
Content-Transfer-Encoding: 7bit

diff --git a/sys/modules/cc/Makefile b/sys/modules/cc/Makefile
index 7b851f5..7f4e94e 100644
--- a/sys/modules/cc/Makefile
+++ b/sys/modules/cc/Makefile
@@ -3,6 +3,7 @@
 SUBDIR=	cc_cdg \
 	cc_chd \
 	cc_cubic \
+	cc_dctcp \
 	cc_hd \
 	cc_htcp \
 	cc_vegas
diff --git a/sys/modules/cc/cc_dctcp/Makefile b/sys/modules/cc/cc_dctcp/Makefile
new file mode 100644
index 0000000..32919cd
--- /dev/null
+++ b/sys/modules/cc/cc_dctcp/Makefile
@@ -0,0 +1,9 @@
+# $FreeBSD$
+
+.include <bsd.own.mk>
+
+.PATH: ${.CURDIR}/../../../netinet/cc
+KMOD=	cc_dctcp
+SRCS=	cc_dctcp.c
+
+.include <bsd.kmod.mk>
diff --git a/sys/netinet/cc.h b/sys/netinet/cc.h
index 14b4a9d..381f94e 100644
--- a/sys/netinet/cc.h
+++ b/sys/netinet/cc.h
@@ -143,6 +143,13 @@ struct cc_algo {
 	/* Called when data transfer resumes after an idle period. */
 	void	(*after_idle)(struct cc_var *ccv);
 
+	/* Called for an additional ECN processing apart from RFC3168. */
+	int	(*ecnpkt_handler)(struct cc_var *ccv, uint8_t iptos, int cwr,
+		    int is_delayack);
+
+	/* Called when the host marks ECN capable transmission (ECT). */
+	int	(*ect_handler)(struct cc_var *ccv);
+
 	STAILQ_ENTRY (cc_algo) entries;
 };
 
diff --git a/sys/netinet/cc/cc_dctcp.c b/sys/netinet/cc/cc_dctcp.c
new file mode 100644
index 0000000..d8cd166
--- /dev/null
+++ b/sys/netinet/cc/cc_dctcp.c
@@ -0,0 +1,442 @@
+/*-
+ * Copyright (c) 2007-2008
+ * 	Swinburne University of Technology, Melbourne, Australia
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstewart@freebsd.org>
+ * Copyright (c) 2014 Midori Kato <katoon@sfc.wide.ad.jp>
+ * Copyright (c) 2014 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * An implementation of the DCTCP algorithm for FreeBSD, based on
+ * "Data Center TCP (DCTCP)" by M. Alizadeh, A. Greenberg, D. A. Maltz,
+ * J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan.,
+ * in ACM Conference on SIGCOMM 2010, New York, USA,
+ * Originally released as the contribution of Microsoft Research project.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/malloc.h>
+#include <sys/module.h>
+#include <sys/socket.h>
+#include <sys/socketvar.h>
+#include <sys/sysctl.h>
+#include <sys/systm.h>
+
+#include <net/vnet.h>
+
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/cc.h>
+#include <netinet/tcp_seq.h>
+#include <netinet/tcp_var.h>
+
+#include <netinet/cc/cc_module.h>
+
+#define	CAST_PTR_INT(X)	(*((int*)(X)))
+
+static VNET_DEFINE(uint32_t, dctcp_shift_g) = 4;
+static VNET_DEFINE(uint32_t, dctcp_slowstart) = 0;
+#define V_dctcp_shift_g		VNET(dctcp_shift_g)
+#define	V_dctcp_slowstart	VNET(dctcp_slowstart)
+
+struct dctcp {
+	/* # of marked bytes during a RTT */
+	int     bytes_ecn;
+	/* # of acked bytes during a RTT */
+	int     bytes_total;
+	/* the fraction of marked bytes */
+	int     alpha;
+	/* CE state of the last segment */
+	int     ce_prev;
+	/* end sequence number of the current window */
+	int     save_sndnxt;
+	/* ECE flag in this segment */
+	int	is_ece;
+	/* ECE flag in the last segment */
+	int	ece_prev;
+	/* # of congestion events */
+	uint32_t	num_cong_events;
+};
+
+static MALLOC_DEFINE(M_dctcp, "dctcp data",
+    "Per connection data required for the dctcp algorithm");
+
+static void	dctcp_ack_received(struct cc_var *ccv, uint16_t type);
+static void	dctcp_after_idle(struct cc_var *ccv);
+static void	dctcp_cb_destroy(struct cc_var *ccv);
+static int	dctcp_cb_init(struct cc_var *ccv);
+static void	dctcp_cong_signal(struct cc_var *ccv, uint32_t type);
+static void	dctcp_conn_init(struct cc_var *ccv);
+static void	dctcp_post_recovery(struct cc_var *ccv);
+static int	dctcp_ecnpkt_handler(struct cc_var *ccv, uint8_t iptos, int cwr,
+		    int is_delayack);
+static int	dctcp_ecthandler(struct cc_var *ccv);
+static void	dctcp_update_alpha(struct cc_var *ccv);
+
+struct cc_algo dctcp_cc_algo = {
+	.name = "dctcp",
+	.ack_received = dctcp_ack_received,
+	.cb_destroy = dctcp_cb_destroy,
+	.cb_init = dctcp_cb_init,
+	.cong_signal = dctcp_cong_signal,
+	.conn_init = dctcp_conn_init,
+	.post_recovery = dctcp_post_recovery,
+	.ecnpkt_handler = dctcp_ecnpkt_handler,
+	.after_idle = dctcp_after_idle,
+	.ect_handler = dctcp_ecthandler,
+};
+
+static void
+dctcp_ack_received(struct cc_var *ccv, uint16_t type)
+{
+	struct dctcp *dctcp_data;
+	int bytes_acked = 0;
+
+	dctcp_data = ccv->cc_data;
+
+	/*
+	 * DCTCP doesn't regard with ECN as a congestion.
+	 * Thus, DCTCP always executes the ACK processing out
+	 * of congestion recovery.
+	 */
+	if (IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+		EXIT_CONGRECOVERY(CCV(ccv, t_flags));
+		newreno_cc_algo.ack_received(ccv, type);
+		ENTER_CONGRECOVERY(CCV(ccv, t_flags));
+	} else
+		newreno_cc_algo.ack_received(ccv, type);
+
+	/* Updates the fraction of marked bytes. */
+	if (CCV(ccv, t_flags) & TF_ECN_PERMIT) {
+
+		if (type == CC_DUPACK)
+			bytes_acked = CCV(ccv, t_maxseg);
+
+		if (type == CC_ACK)
+			bytes_acked = ccv->bytes_this_ack;
+
+		/* Update total bytes. */
+		dctcp_data->bytes_total += bytes_acked;
+
+		/* Update total marked bytes. */
+		if (dctcp_data->is_ece) {
+			if (!dctcp_data->ece_prev
+			    && bytes_acked > CCV(ccv, t_maxseg)) {
+				dctcp_data->bytes_ecn +=
+				    (bytes_acked - CCV(ccv, t_maxseg));
+			} else
+				dctcp_data->bytes_ecn += bytes_acked;
+			dctcp_data->ece_prev = 1;
+		} else {
+			if (dctcp_data->ece_prev
+			    && bytes_acked > CCV(ccv, t_maxseg))
+				dctcp_data->bytes_ecn += CCV(ccv, t_maxseg);
+			dctcp_data->ece_prev = 0;
+		}
+		dctcp_data->is_ece = 0;
+
+		/*
+		 * Update the fraction of marked bytes at the end of
+		 * current window size.
+		 */
+		if ((IN_FASTRECOVERY(CCV(ccv, t_flags)) &&
+		    SEQ_GEQ(ccv->curack, CCV(ccv, snd_recover))) ||
+		    (!IN_FASTRECOVERY(CCV(ccv, t_flags)) &&
+		    SEQ_GT(ccv->curack, dctcp_data->save_sndnxt)))
+			dctcp_update_alpha(ccv);
+	}
+}
+
+static void
+dctcp_after_idle(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+
+	dctcp_data = ccv->cc_data;
+
+	/* Initialize internal parameters after idle time */
+	dctcp_data->bytes_ecn = 0;
+	dctcp_data->bytes_total = 0;
+	dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+	dctcp_data->alpha = 0;
+	dctcp_data->is_ece = 0;
+	dctcp_data->ece_prev = 0;
+	dctcp_data->num_cong_events = 0;
+
+	dctcp_cc_algo.after_idle = newreno_cc_algo.after_idle;
+}
+
+static void
+dctcp_cb_destroy(struct cc_var *ccv)
+{
+	if (ccv->cc_data != NULL)
+		free(ccv->cc_data, M_dctcp);
+}
+
+static int
+dctcp_cb_init(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+
+	dctcp_data = malloc(sizeof(struct dctcp), M_dctcp, M_NOWAIT|M_ZERO);
+
+	if (dctcp_data == NULL)
+		return (ENOMEM);
+
+	/* Initialize some key variables with sensible defaults. */
+	dctcp_data->bytes_ecn = 0;
+	dctcp_data->bytes_total = 0;
+	dctcp_data->alpha = 0;
+	dctcp_data->save_sndnxt = 0;
+	dctcp_data->ce_prev = 0;
+	dctcp_data->is_ece = 0;
+	dctcp_data->ece_prev = 0;
+	dctcp_data->num_cong_events = 0;
+
+	ccv->cc_data = dctcp_data;
+	return (0);
+}
+
+/*
+ * Perform any necessary tasks before we enter congestion recovery.
+ */
+static void
+dctcp_cong_signal(struct cc_var *ccv, uint32_t type)
+{
+	struct dctcp *dctcp_data;
+	u_int win, mss;
+
+	dctcp_data = ccv->cc_data;
+	win = CCV(ccv, snd_cwnd);
+	mss = CCV(ccv, t_maxseg);
+
+	switch (type) {
+	case CC_NDUPACK:
+		if (!IN_FASTRECOVERY(CCV(ccv, t_flags))) {
+			if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+				CCV(ccv, snd_ssthresh) = mss *
+				    max(win / 2 / mss, 2);
+				dctcp_data->num_cong_events++;
+			} else {
+				/* cwnd has already updated as congestion
+				 * recovery. Reverse cwnd value using
+				 * snd_cwnd_prev and recalculate snd_ssthresh
+				 */
+				win = CCV(ccv, snd_cwnd_prev);
+				CCV(ccv, snd_ssthresh) =
+				    max(win / 2 / mss, 2) * mss;
+			}
+			ENTER_RECOVERY(CCV(ccv, t_flags));
+		}
+		break;
+	case CC_ECN:
+		/*
+		 * Save current snd_cwnd when the host encounters both
+		 * congestion recovery and fast recovery.
+		 */
+		CCV(ccv, snd_cwnd_prev) = win;
+		if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+			if (V_dctcp_slowstart &&
+			    dctcp_data->num_cong_events++ == 0) {
+				CCV(ccv, snd_ssthresh) =
+				    mss * max(win / 2 / mss, 2);
+				dctcp_data->alpha = 1024;
+				dctcp_data->bytes_ecn = 0;
+				dctcp_data->bytes_total = 0;
+				dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+			} else
+				CCV(ccv, snd_ssthresh) = max((win - ((win *
+				    dctcp_data->alpha) >> 11)) / mss, 2) * mss;
+			CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh);
+			ENTER_CONGRECOVERY(CCV(ccv, t_flags));
+		}
+		dctcp_data->is_ece = 1;
+		break;
+	case CC_RTO:
+		if (CCV(ccv, t_flags) & TF_ECN_PERMIT) {
+			CCV(ccv, t_flags) |= TF_ECN_SND_CWR;
+			dctcp_update_alpha(ccv);
+			dctcp_data->save_sndnxt += CCV(ccv, t_maxseg);
+			dctcp_data->num_cong_events++;
+		}
+		break;
+	}
+}
+
+static void
+dctcp_conn_init(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+
+	dctcp_data = ccv->cc_data;
+
+	if (CCV(ccv, t_flags) & TF_ECN_PERMIT)
+		dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+}
+
+/*
+ * Perform any necessary tasks before we exit congestion recovery.
+ */
+static void
+dctcp_post_recovery(struct cc_var *ccv)
+{
+	dctcp_cc_algo.post_recovery = newreno_cc_algo.post_recovery;
+
+	if (CCV(ccv, t_flags) & TF_ECN_PERMIT)
+		dctcp_update_alpha(ccv);
+}
+
+static int
+dctcp_ecnpkt_handler(struct cc_var *ccv, uint8_t iptos, int cwr, int is_delayack)
+{
+	struct dctcp *dctcp_data;
+	int ret = 0;
+
+	dctcp_data = ccv->cc_data;
+	/*
+	 * DCTCP responses an ACK immediately
+	 * - when the CE state in between this segment
+	 *   and the last segment is not same
+	 * - when this segment sets the CWR flag
+	 */
+	switch (iptos & IPTOS_ECN_MASK) {
+	case IPTOS_ECN_CE:
+		if (!dctcp_data->ce_prev && is_delayack)
+			ret = 1;
+		dctcp_data->ce_prev = 1;
+		CCV(ccv, t_flags) |= TF_ECN_SND_ECE;
+		break;
+	case IPTOS_ECN_ECT0:
+		if (dctcp_data->ce_prev && is_delayack)
+			ret = 1;
+		CCV(ccv, t_flags) &= ~TF_ECN_SND_ECE;
+		dctcp_data->ce_prev = 0;
+		break;
+	case IPTOS_ECN_ECT1:
+		if (dctcp_data->ce_prev && is_delayack)
+			ret = 1;
+		CCV(ccv, t_flags) &= ~TF_ECN_SND_ECE;
+		dctcp_data->ce_prev = 0;
+		break;
+	}
+	if (cwr && is_delayack)
+		ret = 0;
+
+	return (ret);
+}
+
+static int
+dctcp_ecthandler(struct cc_var *ccv)
+{
+	/* DCTCP always marks ECT */
+	return (1);
+}
+
+/*
+ * Update the fraction of marked bytes named alpha. Then, initialize
+ * several internal parameters at the end of this function.
+ */
+static void
+dctcp_update_alpha(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+	int alpha_prev;
+
+	dctcp_data = ccv->cc_data;
+
+	alpha_prev = dctcp_data->alpha;
+
+	dctcp_data->bytes_total = max(dctcp_data->bytes_total, 1);
+
+	/*
+	 * Update alpha: alpha = (1 - g) * alpha + g * F.
+	 * Alpha must be round to 0 - 1024.
+	 * XXXMIDORI Is more fine-grained alpha necessary?
+	 */
+	dctcp_data->alpha = min(alpha_prev - (alpha_prev >> V_dctcp_shift_g) +
+	    (dctcp_data->bytes_ecn << (10 - V_dctcp_shift_g)) /
+	    dctcp_data->bytes_total, 1024);
+
+	/* Initialize internal parameters for next alpha calculation */
+	dctcp_data->bytes_ecn = 0;
+	dctcp_data->bytes_total = 0;
+	dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+}
+
+static int
+dctcp_shift_g_handler(SYSCTL_HANDLER_ARGS)
+{
+	int error;
+	uint32_t new;
+
+	new = V_dctcp_shift_g ;
+	error = sysctl_handle_int(oidp, &new, 0, req);
+	if (error == 0 && req->newptr != NULL) {
+		if (CAST_PTR_INT(req->newptr) > 1)
+			error = EINVAL;
+		else
+			V_dctcp_shift_g = new;
+	}
+
+	return (error);
+}
+
+static int
+dctcp_slowstart_handler(SYSCTL_HANDLER_ARGS)
+{
+	int error;
+	uint32_t new;
+
+	new = V_dctcp_slowstart;
+	error = sysctl_handle_int(oidp, &new, 0, req);
+	if (error == 0 && req->newptr != NULL) {
+		if (CAST_PTR_INT(req->newptr) > 1)
+			error = EINVAL;
+		else
+			V_dctcp_slowstart = new;
+	}
+
+	return (error);
+}
+
+SYSCTL_DECL(_net_inet_tcp_cc_dctcp);
+SYSCTL_NODE(_net_inet_tcp_cc, OID_AUTO, dctcp, CTLFLAG_RW, NULL,
+    "dctcp congestion control related settings");
+
+SYSCTL_VNET_PROC(_net_inet_tcp_cc_dctcp, OID_AUTO, shift_g,
+    CTLTYPE_UINT|CTLFLAG_RW, &VNET_NAME(dctcp_shift_g), 4,
+    &dctcp_shift_g_handler,
+    "IU", "dctcp shift parameter");
+
+SYSCTL_VNET_PROC(_net_inet_tcp_cc_dctcp, OID_AUTO, slowstart,
+    CTLTYPE_UINT|CTLFLAG_RW, &VNET_NAME(dctcp_slowstart), 0,
+    &dctcp_slowstart_handler,
+    "IU", "half CWND reduction after the first slow start");
+
+DECLARE_CC_MODULE(dctcp, &dctcp_cc_algo);
diff --git a/sys/netinet/tcp_input.c b/sys/netinet/tcp_input.c
index 20c22ed..2822248 100644
--- a/sys/netinet/tcp_input.c
+++ b/sys/netinet/tcp_input.c
@@ -455,6 +455,32 @@ cc_post_recovery(struct tcpcb *tp, struct tcphdr *th)
 	tp->t_bytes_acked = 0;
 }
 
+/*
+ * Indicate whether this ack should be delayed.  We can delay the ack if
+ *	- there is no delayed ack timer in progress and
+ *	- our last ack wasn't a 0-sized window.  We never want to delay
+ *	  the ack that opens up a 0-sized window and
+ *		- delayed acks are enabled or
+ *		- this is a half-synchronized T/TCP connection.
+ */
+#define DELAY_ACK(tp)							\
+	((!tcp_timer_active(tp, TT_DELACK) &&				\
+	    (tp->t_flags & TF_RXWIN0SENT) == 0) &&			\
+	    (V_tcp_delack_enabled || (tp->t_flags & TF_NEEDSYN)))
+
+static void inline
+cc_ecnpkt_handler(struct tcpcb *tp, struct tcphdr *th, uint8_t iptos)
+{
+	INP_WLOCK_ASSERT(tp->t_inpcb);
+
+	if (CC_ALGO(tp)->ecnpkt_handler != NULL) {
+		if (CC_ALGO(tp)->ecnpkt_handler(tp->ccv, iptos,
+		    (th->th_flags & TH_CWR), DELAY_ACK(tp))) {
+			tcp_timer_activate(tp, TT_DELACK, tcp_delacktime);
+		}
+	}
+}
+
 static inline void
 tcp_fields_to_host(struct tcphdr *th)
 {
@@ -502,19 +528,6 @@ do { \
 #endif
 
 /*
- * Indicate whether this ack should be delayed.  We can delay the ack if
- *	- there is no delayed ack timer in progress and
- *	- our last ack wasn't a 0-sized window.  We never want to delay
- *	  the ack that opens up a 0-sized window and
- *		- delayed acks are enabled or
- *		- this is a half-synchronized T/TCP connection.
- */
-#define DELAY_ACK(tp)							\
-	((!tcp_timer_active(tp, TT_DELACK) &&				\
-	    (tp->t_flags & TF_RXWIN0SENT) == 0) &&			\
-	    (V_tcp_delack_enabled || (tp->t_flags & TF_NEEDSYN)))
-
-/*
  * TCP input handling is split into multiple parts:
  *   tcp6_input is a thin wrapper around tcp_input for the extended
  *	ip6_protox[] call format in ip6_input
@@ -1539,6 +1552,10 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th, struct socket *so,
 			TCPSTAT_INC(tcps_ecn_ect1);
 			break;
 		}
+
+		/* Process a packet differently from RFC3168. */
+		cc_ecnpkt_handler(tp, th, iptos);
+
 		/* Congestion experienced. */
 		if (thflags & TH_ECE) {
 			cc_cong_signal(tp, th, CC_ECN);
diff --git a/sys/netinet/tcp_output.c b/sys/netinet/tcp_output.c
index 00d5415..30e9b19 100644
--- a/sys/netinet/tcp_output.c
+++ b/sys/netinet/tcp_output.c
@@ -162,6 +162,18 @@ cc_after_idle(struct tcpcb *tp)
 		CC_ALGO(tp)->after_idle(tp->ccv);
 }
 
+static int inline
+cc_ect_handler(struct tcpcb *tp)
+{
+	INP_WLOCK_ASSERT(tp->t_inpcb);
+
+	if (CC_ALGO(tp)->ect_handler != NULL) {
+		if (CC_ALGO(tp)->ect_handler(tp->ccv))
+			return (1);
+	}
+	return (0);
+}
+
 /*
  * Tcp output routine: figure out what should be sent and send it.
  */
@@ -966,9 +978,15 @@ send:
 		 * If the peer has ECN, mark data packets with
 		 * ECN capable transmission (ECT).
 		 * Ignore pure ack packets, retransmissions and window probes.
+		 * Mark data packet with ECN capable transmission (ECT)
+		 * when CC_ALGO meets specific condition.
+		 * Or, if the peer has ECN, mark data packets with ECT
+		 * (RFC 3168). Ignore pure ack packets, retransmissions
+		 * and window probes.
 		 */
-		if (len > 0 && SEQ_GEQ(tp->snd_nxt, tp->snd_max) &&
-		    !((tp->t_flags & TF_FORCEDATA) && len == 1)) {
+		int mark_ect = cc_ect_handler(tp);
+		if (mark_ect || (len > 0 && SEQ_GEQ(tp->snd_nxt, tp->snd_max)
+		    && !((tp->t_flags & TF_FORCEDATA) && len == 1))) {
 #ifdef INET6
 			if (isipv6)
 				ip6->ip6_flow |= htonl(IPTOS_ECN_ECT0 << 20);

--Apple-Mail=_5AB129EF-65C5-4C0E-ACC4-CE048EBD9985
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
	charset=us-ascii



--Apple-Mail=_5AB129EF-65C5-4C0E-ACC4-CE048EBD9985--

--Apple-Mail=_BAE04A14-8EAB-4CA7-89EF-860A13E6E143
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="signature.asc"
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----

iQCVAwUBUwR20tZcnpRveo1xAQK0zAQAjxgL0C3ZwSCNS+EHWgu2ogmsXEod9ufc
mrCznwe0FgPS79ztnZq6QTYaKvZbkeDyRMyfDvG14xxsPdKVlksVUBtUym+tRpDy
n5CeQhXlUxbZVllrtF36/Dau2iTbyxAHVC/EIx3GvQVGtw39rataavu/h2+GjKox
cFw4QbOlmJU=
=7dfk
-----END PGP SIGNATURE-----

--Apple-Mail=_BAE04A14-8EAB-4CA7-89EF-860A13E6E143--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BE8726B3-7AC9-4CB8-8D12-E05F54AB59AB>