Date: Sun, 2 Dec 2001 00:10:53 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: Richard Sharpe <sharpe@ns.aus.com>, freebsd-hackers@FreeBSD.ORG Subject: Patch #3 (TCP / Linux / Performance) Message-ID: <200112020810.fB28Arr77757@apollo.backplane.com> References: <20011128153817.T61580@monorchid.lemis.com> <15364.38174.938500.946169@caddis.yogotech.com> <20011128104629.A43642@walton.maths.tcd.ie> <5.1.0.14.1.20011130181236.00a80160@postamt1.charite.de> <200111302047.fAUKlT811090@apollo.backplane.com> <200111302130.fAULUU324648@apollo.backplane.com> <3C08CF9D.2030109@ns.aus.com> <200112012138.fB1LcG837063@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I've fixed a couple of additional problems.
* tbench() assumes that accept() propogates the NODELAY tcp option.
It doesn't in FreeBSD. Er, it didn't in FreeBSD... my patch fixes
this.
* If the transwmitter sees a 0 window it stalls waiting for an ack.
However, if delayed acks are turned on the receiver will not
acknowledge a drain of the buffer immediately, it will delay.
This causes severe issues with localhost.
I've included my patch as it currently stands. This patch is
against -stable. With this patch tbench should work properly with
delayed acks turned on (as well as newreno).
There are still a couple of unresolved issues. I noticed that when
connecting locally TCP is non-optimal... when sending a 4100 byte
data block it sends two 1460 byte packets (maxseg), then one
1176 byte packet and one 4 byte packet. The 1176 byte packet is
sent in response to a received ack, causing the last bit of info
to be written out using a small packet. This only occurs on localhost
connections due to the way the stack works.
I will be committing these to both -current now, and -stable tomorrow.
tbench results:
test1 (from test1) - uses TCP's 16K receive & xmit buffers
localhost (from test1) - uses localhost's 48K buffers
test2 (from test1) - uses TCP's 16K receive & xmit buffers
(100BaseTX full duplex switch)
delayed acks turned on (default)
new reno turned on (default)
./tbench 1 test1
Throughput 23.3951 MB/sec (NB=29.2439 MB/sec 233.951 MBit/sec) 1 procs
./tbench 1 localhost
Throughput 29.6299 MB/sec (NB=37.0374 MB/sec 296.299 MBit/sec) 1 procs
./tbench 2 localhost
Throughput 42.963 MB/sec (NB=53.7038 MB/sec 429.63 MBit/sec) 2 procs
./tbench 3 localhost
Throughput 43.9328 MB/sec (NB=54.9161 MB/sec 439.328 MBit/sec) 3 procs
./tbench 1 test2
Throughput 6.43315 MB/sec (NB=8.04144 MB/sec 64.3315 MBit/sec) 1 procs
./tbench 2 test2
Throughput 8.94636 MB/sec (NB=11.183 MB/sec 89.4636 MBit/sec) 2 procs
./tbench 3 test2
Throughput 9.82137 MB/sec (NB=12.2767 MB/sec 98.2137 MBit/sec) 3 procs
With delayed acks turned off:
./tbench 1 test1
Throughput 19.8444 MB/sec (NB=24.8055 MB/sec 198.444 MBit/sec) 1 procs
./tbench 1 localhost
Throughput 26.1442 MB/sec (NB=32.6802 MB/sec 261.442 MBit/sec) 1 procs
./tbench 2 localhost
Throughput 37.1861 MB/sec (NB=46.4826 MB/sec 371.861 MBit/sec) 2 procs
./tbench 3 localhost
Throughput 37.5582 MB/sec (NB=46.9477 MB/sec 375.582 MBit/sec) 3 procs
./tbench 1 test2
Throughput 6.32798 MB/sec (NB=7.90998 MB/sec 63.2798 MBit/sec) 1 procs
./tbench 2 test2
Throughput 8.4896 MB/sec (NB=10.612 MB/sec 84.896 MBit/sec) 2 procs
./tbench 3 test2
Throughput 9.57453 MB/sec (NB=11.9682 MB/sec 95.7453 MBit/sec) 3 procs
-Matt
Index: netinet/tcp_input.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.107.2.18
diff -u -r1.107.2.18 tcp_input.c
--- netinet/tcp_input.c 2001/11/12 22:11:24 1.107.2.18
+++ netinet/tcp_input.c 2001/12/02 07:47:01
@@ -158,10 +158,15 @@
#endif
/*
- * Indicate whether this ack should be delayed.
+ * Indicate whether this ack should be delayed. We can delay the ack if
+ * - delayed acks are enabled and
+ * - there is no delayed ack timer in progress and
+ * - our last ack wasn't a 0-sized window. We never want to delay
+ * the ack that opens up a 0-sized window.
*/
#define DELAY_ACK(tp) \
- (tcp_delack_enabled && !callout_pending(tp->tt_delack))
+ (tcp_delack_enabled && !callout_pending(tp->tt_delack) && \
+ (tp->t_flags & TF_RXWIN0SENT) == 0)
static int
tcp_reass(tp, th, tlenp, m)
@@ -840,7 +845,7 @@
#endif
tp = intotcpcb(inp);
tp->t_state = TCPS_LISTEN;
- tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT);
+ tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY);
/* Compute proper scaling value from buffer space */
while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
Index: netinet/tcp_output.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v
retrieving revision 1.39.2.11
diff -u -r1.39.2.11 tcp_output.c
--- netinet/tcp_output.c 2001/11/30 21:34:28 1.39.2.11
+++ netinet/tcp_output.c 2001/12/02 07:37:29
@@ -116,7 +116,9 @@
u_char opt[TCP_MAXOLEN];
unsigned ipoptlen, optlen, hdrlen;
int idle, sendalot;
+#if 0
int maxburst = TCP_MAXBURST;
+#endif
struct rmxp_tao *taop;
struct rmxp_tao tao_noncached;
#ifdef INET6
@@ -268,28 +270,38 @@
win = sbspace(&so->so_rcv);
/*
- * Sender silly window avoidance. If connection is idle
- * and can send all data, a maximum segment,
- * at least a maximum default-size segment do it,
- * or are forced, do it; otherwise don't bother.
- * If peer's buffer is tiny, then send
- * when window is at least half open.
- * If retransmitting (possibly after persist timer forced us
- * to send into a small window), then must resend.
+ * Sender silly window avoidance. We transmit under the following
+ * conditions when len is non-zero:
+ *
+ * - We have a full segment
+ * - This is the last buffer in a write()/send() and we are
+ * either idle or running NODELAY
+ * - we've timed out (e.g. persist timer)
+ * - we have more then 1/2 the maximum send window's worth of
+ * data (receiver may be limited the window size)
+ * - we need to retransmit
*/
if (len) {
if (len == tp->t_maxseg)
goto send;
- if (!(tp->t_flags & TF_MORETOCOME) &&
- (idle || tp->t_flags & TF_NODELAY) &&
- (tp->t_flags & TF_NOPUSH) == 0 &&
- len + off >= so->so_snd.sb_cc)
+ /*
+ * NOTE! on localhost connections an 'ack' from the remote
+ * end may occur synchronously with the output and cause
+ * us to flush a buffer queued with moretocome. XXX
+ *
+ * note: the len + off check is almost certainly unnecessary.
+ */
+ if (!(tp->t_flags & TF_MORETOCOME) && /* normal case */
+ (idle || (tp->t_flags & TF_NODELAY)) &&
+ len + off >= so->so_snd.sb_cc &&
+ (tp->t_flags & TF_NOPUSH) == 0) {
goto send;
- if (tp->t_force)
+ }
+ if (tp->t_force) /* typ. timeout case */
goto send;
if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
goto send;
- if (SEQ_LT(tp->snd_nxt, tp->snd_max))
+ if (SEQ_LT(tp->snd_nxt, tp->snd_max)) /* retransmit case */
goto send;
}
@@ -688,6 +700,20 @@
if (win > (long)TCP_MAXWIN << tp->rcv_scale)
win = (long)TCP_MAXWIN << tp->rcv_scale;
th->th_win = htons((u_short) (win>>tp->rcv_scale));
+
+ /*
+ * Adjust the RXWIN0SENT flag - indicate that we have advertised
+ * a 0 window. This may cause the remote transmitter to stall. This
+ * flag tells soreceive() to disable delayed acknowledgements when
+ * draining the buffer. This can occur if the receiver is attempting
+ * to read more data then can be buffered prior to transmitting on
+ * the connection.
+ */
+ if (win == 0)
+ tp->t_flags |= TF_RXWIN0SENT;
+ else
+ tp->t_flags &= ~TF_RXWIN0SENT;
+
if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
th->th_flags |= TH_URG;
Index: netinet/tcp_var.h
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v
retrieving revision 1.56.2.8
diff -u -r1.56.2.8 tcp_var.h
--- netinet/tcp_var.h 2001/08/22 00:59:13 1.56.2.8
+++ netinet/tcp_var.h 2001/12/01 21:40:46
@@ -95,6 +95,7 @@
#define TF_SENDCCNEW 0x08000 /* send CCnew instead of CC in SYN */
#define TF_MORETOCOME 0x10000 /* More data to be appended to sock */
#define TF_LQ_OVERFLOW 0x20000 /* listen queue overflow */
+#define TF_RXWIN0SENT 0x40000 /* sent a receiver win 0 in response */
int t_force; /* 1 if forcing out a byte */
tcp_seq snd_una; /* send unacknowledged */
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200112020810.fB28Arr77757>
