Date: Sat, 16 May 1998 19:40:02 -0700 (PDT) From: Bill Fenner <fenner@parc.xerox.com> To: freebsd-bugs@FreeBSD.ORG Subject: Re: bin/6646: dump(8) using remote tape drive is too slow Message-ID: <199805170240.TAA18475@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR bin/6646; it has been noted by GNATS. From: Bill Fenner <fenner@parc.xerox.com> To: wataru-s@mfeed.ad.jp Cc: freebsd-gnats-submit@freebsd.org Subject: Re: bin/6646: dump(8) using remote tape drive is too slow Date: Sat, 16 May 1998 19:36:20 PDT This is an interaction between ACK-every-other, delayed ACKs, and Silly Window Syndrome avoidance. A fix which is better than turning down the MSS is to uncomment the TCP_NODELAY code; this turns off the sender side of SWS avoidance. The gory details: First, a bit about the rdump protocol. It's a request/result protocol, with requests that look like "W10240\n" (meaning "write 10240 bytes"), followed by the 10240 bytes to write. The remote tape daemon then responds with the result, normally something like "A10240\n" (e.g. I did your write and it completed successfully by returning 10240.) This isn't TCP's "normal" bulk transfer mode, in which the sender just sends and keeps sending; there are round-trips to perform the request/result. The client sends a request including the data, and the server waits for all of the data to arrive before sending the reply. At an MSS of 1440, 10240 is 7 and 1/9 packets; 7 full-sized packets and 1 160-byte packet. The 7-byte request and the 7 full-sized data packets all go out fine, since SWS-avoidance always allows full-sized packets: 15:27:37.799556 mango.830 > bigburnt.cmd: P 30780: 30787(7) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799608 mango.830 > bigburnt.cmd: . 30787: 32227(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799650 mango.830 > bigburnt.cmd: . 32227: 33667(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799697 mango.830 > bigburnt.cmd: . 33667: 35107(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799709 mango.830 > bigburnt.cmd: . 35107: 36547(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799739 mango.830 > bigburnt.cmd: . 36547: 37987(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799768 mango.830 > bigburnt.cmd: . 37987: 39427(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] 15:27:37.799779 mango.830 > bigburnt.cmd: . 39427: 40867(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8] Now we get back 3 ACK's, since bigburnt is ACK'ing every other packet: 15:27:37.803520 bigburnt.cmd > mango.830: . ack 33667 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF) 15:27:37.808577 bigburnt.cmd > mango.830: . ack 36547 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF) 15:27:37.808657 bigburnt.cmd > mango.830: . ack 39427 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF) Now, SWS avoidance tells us that we still may not send our 160-byte packet; it doesn't fulfill any of the 3 rules: 1. It's an MSS-sized segment (not true) 2. The segment is larger than half the receiver's buffer (not true) 3. All outstanding data is ACK'd (not true) and this is all the data in our send buffer (true) We have to wait for one of these conditions to become true before we may send the 160-byte packet. After 200ms, bigburnt's delayed-ACK timer fires and it transmits the ACK for the 7th packet (this is part of ACK-every-other). At this time, condition (3) becomes true (all outstanding data is now ACK'd) so we may send our 160-byte packet. 15:27:37.998895 bigburnt.cmd > mango.830: . ack 40867 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF) 15:27:37.998931 mango.830 > bigburnt.cmd: P 40867: 41027(160) ack 26 win 17255 <nop,nop,timestamp 168323 1198654,nop,nop,cc 496> (DF) [tos 0x8] Finally, the server sends us our result: 15:27:37.999335 bigburnt.cmd > mango.830: P 26:33(7) ack 41027 win 10240 <nop,nop,timestamp 1198654 168323,nop,nop,cc 2548> (DF) The 200ms delay every 7 packets is what's killing performance. This problem was much less severe before ACK-every-other was introduced; if every packet is ACK'd, the delay is reduced from a delayed-ACK timeout to the RTT of the connection. This doesn't happen when the MSS is 1024, because the 10240 byte buffer size is evenly divisible by the MSS, so rule (1) is always true. This isn't a particularly great workaround, since there's no way to guarantee that the MSS will be as large as 1024, and setting the MSS to 1024 when it can be larger hurts performance. Setting TCP_NODELAY on the socket removes the first half of step 3 of SWS avoidance, meaning that it will always send when it's all of the data in the socket buffer. I'm not sure what I think the TCP solution is; applications shouldn't have to set TCP_NODELAY just because (blocksize div MSS) % 2 != 0. (Substitute N for 2 for "ACK-every-N" peers.) I think the answer is that condition (3) SWS avoidance has to be modified to take into account the other end's ACK frequency. Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199805170240.TAA18475>