Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 Dec 2001 11:47:25 -0600
From:      D J Hawkey Jr <hawkeyd@visi.com>
To:        hackers@freebsd.org
Subject:   FreeBSD performs worse that Linux - Patches #2 & #3
Message-ID:  <20011206114725.A836@sheol.localdomain>

next in thread | raw e-mail | index | archive | help
Hello all.

I read with interest (and fair ignorance ;-) ) the thread about delayed
ACKs in the TCP/IP stack.

Looking at the results of tbench, it looked like something I wanted in
my 4.2 kernel. So I patched my kernel accordingly, and ran the tests:

---8<---

Pre-patch:

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost
Throughput 1.15675 MB/sec (NB=1.44593 MB/sec  11.5675 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost
Throughput 2.18475 MB/sec (NB=2.73094 MB/sec  21.8475 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost
Throughput 3.20828 MB/sec (NB=4.01035 MB/sec  32.0828 MBit/sec)

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol    
Throughput 1.14315 MB/sec (NB=1.42894 MB/sec  11.4315 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol
Throughput 2.12477 MB/sec (NB=2.65596 MB/sec  21.2477 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol
Throughput 3.16156 MB/sec (NB=3.95195 MB/sec  31.6156 MBit/sec)

Post-patch:

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost
Throughput 13.8458 MB/sec (NB=17.3073 MB/sec  138.458 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost
Throughput 12.8562 MB/sec (NB=16.0703 MB/sec  128.562 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost
Throughput 12.1043 MB/sec (NB=15.1304 MB/sec  121.043 MBit/sec)

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol    
Throughput 9.62885 MB/sec (NB=12.0361 MB/sec  96.2885 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol
Throughput 8.7068 MB/sec (NB=10.8835 MB/sec  87.068 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol
Throughput 8.89676 MB/sec (NB=11.1209 MB/sec  88.9676 MBit/sec)

--->8---

I didn't bother running through my 100Mb switch - only 10Mb NICs on the
other side. Similar results going to the "other" NIC in this box (it's
my NAT/FW/GW).

Machine particulars:

  FreeBSD sheol.localdomain 4.2-RELEASE FreeBSD 4.2-RELEASE #33: Thu Dec  6 10:20:08 CST 2001     root@sheol.localdomain:/usr/src/sys/compile/SHEOL  i386


  Copyright (c) 1992-2000 The FreeBSD Project.
  Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
  FreeBSD 4.2-RELEASE #33: Thu Dec  6 10:20:08 CST 2001
    root@sheol.localdomain:/usr/src/sys/compile/SHEOL
  Timecounter "i8254"  frequency 1193182 Hz
  CPU: Pentium III/Pentium III Xeon/Celeron (764.35-MHz 686-class CPU)
    Origin = "GenuineIntel"  Id = 0x686  Stepping = 6
    Features=0x383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE>
  ...
  dc0: <ADMtek AN985 10/100BaseTX> port 0x3000-0x30ff mem 0xf4100000-0xf41003ff irq 11 at device 13.0 on pci1


  dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet 192.168.16.2 netmask 0xffffff00 broadcast 192.168.16.255
        inet6 fe80::203:6dff:fe11:63d2%dc0 prefixlen 64 scopeid 0x1 
        ether 00:03:6d:11:63:d2 
        media: autoselect (100baseTX <full-duplex>) status: active
        supported media: autoselect 100baseTX <full-duplex> 100baseTX 10baseT/UTP <full-duplex> 10baseT/UTP none


If Matt or any other qualified hackers can make the time to double-check
my patches, I'd appreciate it. Matt's first patch didn't apply (no NewReno
in 4.2REL), and the third patch (to tcp_input.c) required a little more work
(I changed tests for 'tcp_delack_enabled' to 'DELAY_ACK()'). I'd just like
some assurance I got it right.

All in all, kudos to Matt for this. In day-to-day use, I can "feel" the
improvementi, and everything seems as solid as ever!

Dave

-- 
  ______________________                         ______________________
  \__________________   \    D. J. HAWKEY JR.   /   __________________/
     \________________/\     hawkeyd@visi.com    /\________________/
                      http://www.visi.com/~hawkeyd/

---8<---

--- /usr/src/sys/kern/uipc_socket.c.42REL	Fri Nov 17 13:47:27 2000
+++ /usr/src/sys/kern/uipc_socket.c	Thu Dec  6 07:26:28 2001
@@ -913,6 +913,14 @@
 		    !sosendallatonce(so) && !nextrecord) {
 			if (so->so_error || so->so_state & SS_CANTRCVMORE)
 				break;
+			/*
+			 * The window might have closed to zero, make
+			 * sure we send an ack now that we've drained
+			 * the buffer or we might end up blocking until
+			 * the idle takes over (5 seconds).
+			 */
+			if (pr->pr_flags & PR_WANTRCVD && so->so_pcb)
+				(*pr->pr_usrreqs->pru_rcvd)(so, flags);
 			error = sbwait(&so->so_rcv);
 			if (error) {
 				sbunlock(&so->so_rcv);


--- /usr/src/sys/netinet/tcp_input.c.42REL	Wed Aug 16 01:14:23 2000
+++ /usr/src/sys/netinet/tcp_input.c	Thu Dec  6 10:05:53 2001
@@ -164,6 +164,17 @@
 #endif
 
 /*
+ * Indicate whether this ack should be delayed.  We can delay the ack if
+ *      - delayed acks are enabled and
+ *      - there is no delayed ack timer in progress and
+ *      - our last ack wasn't a 0-sized window.  We never want to delay
+ *        the ack that opens up a 0-sized window.
+ */
+#define DELAY_ACK(tp) \
+	(tcp_delack_enabled && !callout_pending(tp->tt_delack) && \
+	(tp->t_flags & TF_RXWIN0SENT) == 0)
+
+/*
  * Insert segment which inludes th into reassembly queue of tcp with
  * control block tp.  Return TH_FIN if reassembly now includes
  * a segment with FIN.  The macro form does the common case inline
@@ -177,7 +188,7 @@
 	if ((th)->th_seq == (tp)->rcv_nxt && \
 	    LIST_EMPTY(&(tp)->t_segq) && \
 	    (tp)->t_state == TCPS_ESTABLISHED) { \
-		if (tcp_delack_enabled) \
+		if (DELAY_ACK(tp)) \
 			callout_reset(tp->tt_delack, tcp_delacktime, \
 			    tcp_timer_delack, tp); \
 		else \
@@ -817,7 +828,7 @@
 #endif
 			tp = intotcpcb(inp);
 			tp->t_state = TCPS_LISTEN;
-			tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT);
+			tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY);
 
 			/* Compute proper scaling value from buffer space */
 			while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
@@ -961,7 +972,7 @@
 			m_adj(m, drop_hdrlen);	/* delayed header drop */
 			sbappend(&so->so_rcv, m);
 			sorwakeup(so);
-			if (tcp_delack_enabled) {
+			if (DELAY_ACK(tp)) {
 	                        callout_reset(tp->tt_delack, tcp_delacktime,
 	                            tcp_timer_delack, tp);
 			} else {
@@ -1144,7 +1155,7 @@
 			 * segment.  Otherwise must send ACK now in case
 			 * the other side is slow starting.
 			 */
-			if (tcp_delack_enabled && ((thflags & TH_FIN) ||
+			if (DELAY_ACK(tp) && ((thflags & TH_FIN) ||
 			    (tlen != 0 &&
 #ifdef INET6
 			      ((isipv6 && in6_localaddr(&inp->in6p_faddr))
@@ -1289,7 +1300,7 @@
 			 * If there's data, delay ACK; if there's also a FIN
 			 * ACKNOW will be turned on later.
 			 */
-			if (tcp_delack_enabled && tlen != 0)
+			if (DELAY_ACK(tp) && tlen != 0)
                                 callout_reset(tp->tt_delack, tcp_delacktime,  
                                     tcp_timer_delack, tp);  
 			else
@@ -2117,7 +2128,7 @@
 			 *  Otherwise, since we received a FIN then no
 			 *  more input can be expected, send ACK now.
 			 */
-			if (tcp_delack_enabled && (tp->t_flags & TF_NEEDSYN))
+			if (DELAY_ACK(tp) && (tp->t_flags & TF_NEEDSYN))
                                 callout_reset(tp->tt_delack, tcp_delacktime,  
                                     tcp_timer_delack, tp);  
 			else


--- /usr/src/sys/netinet/tcp_output.c.42REL	Tue Sep 12 23:27:06 2000
+++ /usr/src/sys/netinet/tcp_output.c	Thu Dec  6 10:05:53 2001
@@ -266,28 +266,38 @@
 	win = sbspace(&so->so_rcv);
 
 	/*
-	 * Sender silly window avoidance.  If connection is idle
-	 * and can send all data, a maximum segment,
-	 * at least a maximum default-size segment do it,
-	 * or are forced, do it; otherwise don't bother.
-	 * If peer's buffer is tiny, then send
-	 * when window is at least half open.
-	 * If retransmitting (possibly after persist timer forced us
-	 * to send into a small window), then must resend.
+	 * Sender silly window avoidance.   We transmit under the following
+	 * conditions when len is non-zero:
+	 *
+	 *      - We have a full segment
+	 *      - This is the last buffer in a write()/send() and we are
+	 *        either idle or running NODELAY
+	 *      - we've timed out (e.g. persist timer)
+	 *      - we have more then 1/2 the maximum send window's worth of
+	 *        data (receiver may be limited the window size)
+	 *      - we need to retransmit
 	 */
 	if (len) {
 		if (len == tp->t_maxseg)
 			goto send;
-		if (!(tp->t_flags & TF_MORETOCOME) &&
-		    (idle || tp->t_flags & TF_NODELAY) &&
-		    (tp->t_flags & TF_NOPUSH) == 0 &&
-		    len + off >= so->so_snd.sb_cc)
+		/*
+		 * NOTE! on localhost connections an 'ack' from the remote
+		 * end may occur synchronously with the output and cause
+		 * us to flush a buffer queued with moretocome.  XXX
+		 *
+		 * note: the len + off check is almost certainly unnecessary.
+		 */
+		if (!(tp->t_flags & TF_MORETOCOME) &&   /* normal case */
+		    (idle || (tp->t_flags & TF_NODELAY)) &&
+		    len + off >= so->so_snd.sb_cc &&
+		    (tp->t_flags & TF_NOPUSH) == 0) {
 			goto send;
-		if (tp->t_force)
+		}
+		if (tp->t_force)			/* typ. timeout case */
 			goto send;
 		if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
 			goto send;
-		if (SEQ_LT(tp->snd_nxt, tp->snd_max))
+		if (SEQ_LT(tp->snd_nxt, tp->snd_max))	/* retransmit case */
 			goto send;
 	}
 
@@ -694,6 +704,20 @@
 	if (win > (long)TCP_MAXWIN << tp->rcv_scale)
 		win = (long)TCP_MAXWIN << tp->rcv_scale;
 	th->th_win = htons((u_short) (win>>tp->rcv_scale));
+
+	/*
+	 * Adjust the RXWIN0SENT flag - indicate that we have advertised
+	 * a 0 window.  This may cause the remote transmitter to stall.  This
+	 * flag tells soreceive() to disable delayed acknowledgements when
+	 * draining the buffer.  This can occur if the receiver is attempting
+	 * to read more data then can be buffered prior to transmitting on
+	 * the connection.
+	 */
+	if (win == 0)
+		tp->t_flags |= TF_RXWIN0SENT;
+	else
+		tp->t_flags &= ~TF_RXWIN0SENT;
+
 	if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
 		th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
 		th->th_flags |= TH_URG;


--- /usr/src/sys/netinet/tcp_var.h.42REL	Wed Aug 16 01:14:23 2000
+++ /usr/src/sys/netinet/tcp_var.h	Thu Dec  6 10:05:53 2001
@@ -95,6 +95,7 @@
 #define	TF_SENDCCNEW	0x08000		/* send CCnew instead of CC in SYN */
 #define	TF_MORETOCOME	0x10000		/* More data to be appended to sock */
 #define	TF_LQ_OVERFLOW	0x20000		/* listen queue overflow */
+#define	TF_RXWIN0SENT	0x40000		/* sent a receiver win 0 in response */
 	int	t_force;		/* 1 if forcing out a byte */
 
 	tcp_seq	snd_una;		/* send unacknowledged */

--->8---


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011206114725.A836>