Date: Thu, 6 Dec 2001 11:47:25 -0600 From: D J Hawkey Jr <hawkeyd@visi.com> To: hackers@freebsd.org Subject: FreeBSD performs worse that Linux - Patches #2 & #3 Message-ID: <20011206114725.A836@sheol.localdomain>
next in thread | raw e-mail | index | archive | help
Hello all.
I read with interest (and fair ignorance ;-) ) the thread about delayed
ACKs in the TCP/IP stack.
Looking at the results of tbench, it looked like something I wanted in
my 4.2 kernel. So I patched my kernel accordingly, and ran the tests:
---8<---
Pre-patch:
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost
Throughput 1.15675 MB/sec (NB=1.44593 MB/sec 11.5675 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost
Throughput 2.18475 MB/sec (NB=2.73094 MB/sec 21.8475 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost
Throughput 3.20828 MB/sec (NB=4.01035 MB/sec 32.0828 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol
Throughput 1.14315 MB/sec (NB=1.42894 MB/sec 11.4315 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol
Throughput 2.12477 MB/sec (NB=2.65596 MB/sec 21.2477 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol
Throughput 3.16156 MB/sec (NB=3.95195 MB/sec 31.6156 MBit/sec)
Post-patch:
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost
Throughput 13.8458 MB/sec (NB=17.3073 MB/sec 138.458 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost
Throughput 12.8562 MB/sec (NB=16.0703 MB/sec 128.562 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost
Throughput 12.1043 MB/sec (NB=15.1304 MB/sec 121.043 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol
Throughput 9.62885 MB/sec (NB=12.0361 MB/sec 96.2885 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol
Throughput 8.7068 MB/sec (NB=10.8835 MB/sec 87.068 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol
Throughput 8.89676 MB/sec (NB=11.1209 MB/sec 88.9676 MBit/sec)
--->8---
I didn't bother running through my 100Mb switch - only 10Mb NICs on the
other side. Similar results going to the "other" NIC in this box (it's
my NAT/FW/GW).
Machine particulars:
FreeBSD sheol.localdomain 4.2-RELEASE FreeBSD 4.2-RELEASE #33: Thu Dec 6 10:20:08 CST 2001 root@sheol.localdomain:/usr/src/sys/compile/SHEOL i386
Copyright (c) 1992-2000 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 4.2-RELEASE #33: Thu Dec 6 10:20:08 CST 2001
root@sheol.localdomain:/usr/src/sys/compile/SHEOL
Timecounter "i8254" frequency 1193182 Hz
CPU: Pentium III/Pentium III Xeon/Celeron (764.35-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0x686 Stepping = 6
Features=0x383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE>
...
dc0: <ADMtek AN985 10/100BaseTX> port 0x3000-0x30ff mem 0xf4100000-0xf41003ff irq 11 at device 13.0 on pci1
dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet 192.168.16.2 netmask 0xffffff00 broadcast 192.168.16.255
inet6 fe80::203:6dff:fe11:63d2%dc0 prefixlen 64 scopeid 0x1
ether 00:03:6d:11:63:d2
media: autoselect (100baseTX <full-duplex>) status: active
supported media: autoselect 100baseTX <full-duplex> 100baseTX 10baseT/UTP <full-duplex> 10baseT/UTP none
If Matt or any other qualified hackers can make the time to double-check
my patches, I'd appreciate it. Matt's first patch didn't apply (no NewReno
in 4.2REL), and the third patch (to tcp_input.c) required a little more work
(I changed tests for 'tcp_delack_enabled' to 'DELAY_ACK()'). I'd just like
some assurance I got it right.
All in all, kudos to Matt for this. In day-to-day use, I can "feel" the
improvementi, and everything seems as solid as ever!
Dave
--
______________________ ______________________
\__________________ \ D. J. HAWKEY JR. / __________________/
\________________/\ hawkeyd@visi.com /\________________/
http://www.visi.com/~hawkeyd/
---8<---
--- /usr/src/sys/kern/uipc_socket.c.42REL Fri Nov 17 13:47:27 2000
+++ /usr/src/sys/kern/uipc_socket.c Thu Dec 6 07:26:28 2001
@@ -913,6 +913,14 @@
!sosendallatonce(so) && !nextrecord) {
if (so->so_error || so->so_state & SS_CANTRCVMORE)
break;
+ /*
+ * The window might have closed to zero, make
+ * sure we send an ack now that we've drained
+ * the buffer or we might end up blocking until
+ * the idle takes over (5 seconds).
+ */
+ if (pr->pr_flags & PR_WANTRCVD && so->so_pcb)
+ (*pr->pr_usrreqs->pru_rcvd)(so, flags);
error = sbwait(&so->so_rcv);
if (error) {
sbunlock(&so->so_rcv);
--- /usr/src/sys/netinet/tcp_input.c.42REL Wed Aug 16 01:14:23 2000
+++ /usr/src/sys/netinet/tcp_input.c Thu Dec 6 10:05:53 2001
@@ -164,6 +164,17 @@
#endif
/*
+ * Indicate whether this ack should be delayed. We can delay the ack if
+ * - delayed acks are enabled and
+ * - there is no delayed ack timer in progress and
+ * - our last ack wasn't a 0-sized window. We never want to delay
+ * the ack that opens up a 0-sized window.
+ */
+#define DELAY_ACK(tp) \
+ (tcp_delack_enabled && !callout_pending(tp->tt_delack) && \
+ (tp->t_flags & TF_RXWIN0SENT) == 0)
+
+/*
* Insert segment which inludes th into reassembly queue of tcp with
* control block tp. Return TH_FIN if reassembly now includes
* a segment with FIN. The macro form does the common case inline
@@ -177,7 +188,7 @@
if ((th)->th_seq == (tp)->rcv_nxt && \
LIST_EMPTY(&(tp)->t_segq) && \
(tp)->t_state == TCPS_ESTABLISHED) { \
- if (tcp_delack_enabled) \
+ if (DELAY_ACK(tp)) \
callout_reset(tp->tt_delack, tcp_delacktime, \
tcp_timer_delack, tp); \
else \
@@ -817,7 +828,7 @@
#endif
tp = intotcpcb(inp);
tp->t_state = TCPS_LISTEN;
- tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT);
+ tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY);
/* Compute proper scaling value from buffer space */
while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
@@ -961,7 +972,7 @@
m_adj(m, drop_hdrlen); /* delayed header drop */
sbappend(&so->so_rcv, m);
sorwakeup(so);
- if (tcp_delack_enabled) {
+ if (DELAY_ACK(tp)) {
callout_reset(tp->tt_delack, tcp_delacktime,
tcp_timer_delack, tp);
} else {
@@ -1144,7 +1155,7 @@
* segment. Otherwise must send ACK now in case
* the other side is slow starting.
*/
- if (tcp_delack_enabled && ((thflags & TH_FIN) ||
+ if (DELAY_ACK(tp) && ((thflags & TH_FIN) ||
(tlen != 0 &&
#ifdef INET6
((isipv6 && in6_localaddr(&inp->in6p_faddr))
@@ -1289,7 +1300,7 @@
* If there's data, delay ACK; if there's also a FIN
* ACKNOW will be turned on later.
*/
- if (tcp_delack_enabled && tlen != 0)
+ if (DELAY_ACK(tp) && tlen != 0)
callout_reset(tp->tt_delack, tcp_delacktime,
tcp_timer_delack, tp);
else
@@ -2117,7 +2128,7 @@
* Otherwise, since we received a FIN then no
* more input can be expected, send ACK now.
*/
- if (tcp_delack_enabled && (tp->t_flags & TF_NEEDSYN))
+ if (DELAY_ACK(tp) && (tp->t_flags & TF_NEEDSYN))
callout_reset(tp->tt_delack, tcp_delacktime,
tcp_timer_delack, tp);
else
--- /usr/src/sys/netinet/tcp_output.c.42REL Tue Sep 12 23:27:06 2000
+++ /usr/src/sys/netinet/tcp_output.c Thu Dec 6 10:05:53 2001
@@ -266,28 +266,38 @@
win = sbspace(&so->so_rcv);
/*
- * Sender silly window avoidance. If connection is idle
- * and can send all data, a maximum segment,
- * at least a maximum default-size segment do it,
- * or are forced, do it; otherwise don't bother.
- * If peer's buffer is tiny, then send
- * when window is at least half open.
- * If retransmitting (possibly after persist timer forced us
- * to send into a small window), then must resend.
+ * Sender silly window avoidance. We transmit under the following
+ * conditions when len is non-zero:
+ *
+ * - We have a full segment
+ * - This is the last buffer in a write()/send() and we are
+ * either idle or running NODELAY
+ * - we've timed out (e.g. persist timer)
+ * - we have more then 1/2 the maximum send window's worth of
+ * data (receiver may be limited the window size)
+ * - we need to retransmit
*/
if (len) {
if (len == tp->t_maxseg)
goto send;
- if (!(tp->t_flags & TF_MORETOCOME) &&
- (idle || tp->t_flags & TF_NODELAY) &&
- (tp->t_flags & TF_NOPUSH) == 0 &&
- len + off >= so->so_snd.sb_cc)
+ /*
+ * NOTE! on localhost connections an 'ack' from the remote
+ * end may occur synchronously with the output and cause
+ * us to flush a buffer queued with moretocome. XXX
+ *
+ * note: the len + off check is almost certainly unnecessary.
+ */
+ if (!(tp->t_flags & TF_MORETOCOME) && /* normal case */
+ (idle || (tp->t_flags & TF_NODELAY)) &&
+ len + off >= so->so_snd.sb_cc &&
+ (tp->t_flags & TF_NOPUSH) == 0) {
goto send;
- if (tp->t_force)
+ }
+ if (tp->t_force) /* typ. timeout case */
goto send;
if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
goto send;
- if (SEQ_LT(tp->snd_nxt, tp->snd_max))
+ if (SEQ_LT(tp->snd_nxt, tp->snd_max)) /* retransmit case */
goto send;
}
@@ -694,6 +704,20 @@
if (win > (long)TCP_MAXWIN << tp->rcv_scale)
win = (long)TCP_MAXWIN << tp->rcv_scale;
th->th_win = htons((u_short) (win>>tp->rcv_scale));
+
+ /*
+ * Adjust the RXWIN0SENT flag - indicate that we have advertised
+ * a 0 window. This may cause the remote transmitter to stall. This
+ * flag tells soreceive() to disable delayed acknowledgements when
+ * draining the buffer. This can occur if the receiver is attempting
+ * to read more data then can be buffered prior to transmitting on
+ * the connection.
+ */
+ if (win == 0)
+ tp->t_flags |= TF_RXWIN0SENT;
+ else
+ tp->t_flags &= ~TF_RXWIN0SENT;
+
if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
th->th_flags |= TH_URG;
--- /usr/src/sys/netinet/tcp_var.h.42REL Wed Aug 16 01:14:23 2000
+++ /usr/src/sys/netinet/tcp_var.h Thu Dec 6 10:05:53 2001
@@ -95,6 +95,7 @@
#define TF_SENDCCNEW 0x08000 /* send CCnew instead of CC in SYN */
#define TF_MORETOCOME 0x10000 /* More data to be appended to sock */
#define TF_LQ_OVERFLOW 0x20000 /* listen queue overflow */
+#define TF_RXWIN0SENT 0x40000 /* sent a receiver win 0 in response */
int t_force; /* 1 if forcing out a byte */
tcp_seq snd_una; /* send unacknowledged */
--->8---
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011206114725.A836>
