Date: Sun, 2 Dec 2001 00:10:53 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: Richard Sharpe <sharpe@ns.aus.com>, freebsd-hackers@FreeBSD.ORG Subject: Patch #3 (TCP / Linux / Performance) Message-ID: <200112020810.fB28Arr77757@apollo.backplane.com> References: <20011128153817.T61580@monorchid.lemis.com> <15364.38174.938500.946169@caddis.yogotech.com> <20011128104629.A43642@walton.maths.tcd.ie> <5.1.0.14.1.20011130181236.00a80160@postamt1.charite.de> <200111302047.fAUKlT811090@apollo.backplane.com> <200111302130.fAULUU324648@apollo.backplane.com> <3C08CF9D.2030109@ns.aus.com> <200112012138.fB1LcG837063@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I've fixed a couple of additional problems. * tbench() assumes that accept() propogates the NODELAY tcp option. It doesn't in FreeBSD. Er, it didn't in FreeBSD... my patch fixes this. * If the transwmitter sees a 0 window it stalls waiting for an ack. However, if delayed acks are turned on the receiver will not acknowledge a drain of the buffer immediately, it will delay. This causes severe issues with localhost. I've included my patch as it currently stands. This patch is against -stable. With this patch tbench should work properly with delayed acks turned on (as well as newreno). There are still a couple of unresolved issues. I noticed that when connecting locally TCP is non-optimal... when sending a 4100 byte data block it sends two 1460 byte packets (maxseg), then one 1176 byte packet and one 4 byte packet. The 1176 byte packet is sent in response to a received ack, causing the last bit of info to be written out using a small packet. This only occurs on localhost connections due to the way the stack works. I will be committing these to both -current now, and -stable tomorrow. tbench results: test1 (from test1) - uses TCP's 16K receive & xmit buffers localhost (from test1) - uses localhost's 48K buffers test2 (from test1) - uses TCP's 16K receive & xmit buffers (100BaseTX full duplex switch) delayed acks turned on (default) new reno turned on (default) ./tbench 1 test1 Throughput 23.3951 MB/sec (NB=29.2439 MB/sec 233.951 MBit/sec) 1 procs ./tbench 1 localhost Throughput 29.6299 MB/sec (NB=37.0374 MB/sec 296.299 MBit/sec) 1 procs ./tbench 2 localhost Throughput 42.963 MB/sec (NB=53.7038 MB/sec 429.63 MBit/sec) 2 procs ./tbench 3 localhost Throughput 43.9328 MB/sec (NB=54.9161 MB/sec 439.328 MBit/sec) 3 procs ./tbench 1 test2 Throughput 6.43315 MB/sec (NB=8.04144 MB/sec 64.3315 MBit/sec) 1 procs ./tbench 2 test2 Throughput 8.94636 MB/sec (NB=11.183 MB/sec 89.4636 MBit/sec) 2 procs ./tbench 3 test2 Throughput 9.82137 MB/sec (NB=12.2767 MB/sec 98.2137 MBit/sec) 3 procs With delayed acks turned off: ./tbench 1 test1 Throughput 19.8444 MB/sec (NB=24.8055 MB/sec 198.444 MBit/sec) 1 procs ./tbench 1 localhost Throughput 26.1442 MB/sec (NB=32.6802 MB/sec 261.442 MBit/sec) 1 procs ./tbench 2 localhost Throughput 37.1861 MB/sec (NB=46.4826 MB/sec 371.861 MBit/sec) 2 procs ./tbench 3 localhost Throughput 37.5582 MB/sec (NB=46.9477 MB/sec 375.582 MBit/sec) 3 procs ./tbench 1 test2 Throughput 6.32798 MB/sec (NB=7.90998 MB/sec 63.2798 MBit/sec) 1 procs ./tbench 2 test2 Throughput 8.4896 MB/sec (NB=10.612 MB/sec 84.896 MBit/sec) 2 procs ./tbench 3 test2 Throughput 9.57453 MB/sec (NB=11.9682 MB/sec 95.7453 MBit/sec) 3 procs -Matt Index: netinet/tcp_input.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v retrieving revision 1.107.2.18 diff -u -r1.107.2.18 tcp_input.c --- netinet/tcp_input.c 2001/11/12 22:11:24 1.107.2.18 +++ netinet/tcp_input.c 2001/12/02 07:47:01 @@ -158,10 +158,15 @@ #endif /* - * Indicate whether this ack should be delayed. + * Indicate whether this ack should be delayed. We can delay the ack if + * - delayed acks are enabled and + * - there is no delayed ack timer in progress and + * - our last ack wasn't a 0-sized window. We never want to delay + * the ack that opens up a 0-sized window. */ #define DELAY_ACK(tp) \ - (tcp_delack_enabled && !callout_pending(tp->tt_delack)) + (tcp_delack_enabled && !callout_pending(tp->tt_delack) && \ + (tp->t_flags & TF_RXWIN0SENT) == 0) static int tcp_reass(tp, th, tlenp, m) @@ -840,7 +845,7 @@ #endif tp = intotcpcb(inp); tp->t_state = TCPS_LISTEN; - tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT); + tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY); /* Compute proper scaling value from buffer space */ while (tp->request_r_scale < TCP_MAX_WINSHIFT && Index: netinet/tcp_output.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v retrieving revision 1.39.2.11 diff -u -r1.39.2.11 tcp_output.c --- netinet/tcp_output.c 2001/11/30 21:34:28 1.39.2.11 +++ netinet/tcp_output.c 2001/12/02 07:37:29 @@ -116,7 +116,9 @@ u_char opt[TCP_MAXOLEN]; unsigned ipoptlen, optlen, hdrlen; int idle, sendalot; +#if 0 int maxburst = TCP_MAXBURST; +#endif struct rmxp_tao *taop; struct rmxp_tao tao_noncached; #ifdef INET6 @@ -268,28 +270,38 @@ win = sbspace(&so->so_rcv); /* - * Sender silly window avoidance. If connection is idle - * and can send all data, a maximum segment, - * at least a maximum default-size segment do it, - * or are forced, do it; otherwise don't bother. - * If peer's buffer is tiny, then send - * when window is at least half open. - * If retransmitting (possibly after persist timer forced us - * to send into a small window), then must resend. + * Sender silly window avoidance. We transmit under the following + * conditions when len is non-zero: + * + * - We have a full segment + * - This is the last buffer in a write()/send() and we are + * either idle or running NODELAY + * - we've timed out (e.g. persist timer) + * - we have more then 1/2 the maximum send window's worth of + * data (receiver may be limited the window size) + * - we need to retransmit */ if (len) { if (len == tp->t_maxseg) goto send; - if (!(tp->t_flags & TF_MORETOCOME) && - (idle || tp->t_flags & TF_NODELAY) && - (tp->t_flags & TF_NOPUSH) == 0 && - len + off >= so->so_snd.sb_cc) + /* + * NOTE! on localhost connections an 'ack' from the remote + * end may occur synchronously with the output and cause + * us to flush a buffer queued with moretocome. XXX + * + * note: the len + off check is almost certainly unnecessary. + */ + if (!(tp->t_flags & TF_MORETOCOME) && /* normal case */ + (idle || (tp->t_flags & TF_NODELAY)) && + len + off >= so->so_snd.sb_cc && + (tp->t_flags & TF_NOPUSH) == 0) { goto send; - if (tp->t_force) + } + if (tp->t_force) /* typ. timeout case */ goto send; if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0) goto send; - if (SEQ_LT(tp->snd_nxt, tp->snd_max)) + if (SEQ_LT(tp->snd_nxt, tp->snd_max)) /* retransmit case */ goto send; } @@ -688,6 +700,20 @@ if (win > (long)TCP_MAXWIN << tp->rcv_scale) win = (long)TCP_MAXWIN << tp->rcv_scale; th->th_win = htons((u_short) (win>>tp->rcv_scale)); + + /* + * Adjust the RXWIN0SENT flag - indicate that we have advertised + * a 0 window. This may cause the remote transmitter to stall. This + * flag tells soreceive() to disable delayed acknowledgements when + * draining the buffer. This can occur if the receiver is attempting + * to read more data then can be buffered prior to transmitting on + * the connection. + */ + if (win == 0) + tp->t_flags |= TF_RXWIN0SENT; + else + tp->t_flags &= ~TF_RXWIN0SENT; + if (SEQ_GT(tp->snd_up, tp->snd_nxt)) { th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt)); th->th_flags |= TH_URG; Index: netinet/tcp_var.h =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v retrieving revision 1.56.2.8 diff -u -r1.56.2.8 tcp_var.h --- netinet/tcp_var.h 2001/08/22 00:59:13 1.56.2.8 +++ netinet/tcp_var.h 2001/12/01 21:40:46 @@ -95,6 +95,7 @@ #define TF_SENDCCNEW 0x08000 /* send CCnew instead of CC in SYN */ #define TF_MORETOCOME 0x10000 /* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x20000 /* listen queue overflow */ +#define TF_RXWIN0SENT 0x40000 /* sent a receiver win 0 in response */ int t_force; /* 1 if forcing out a byte */ tcp_seq snd_una; /* send unacknowledged */ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200112020810.fB28Arr77757>