Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 16 May 1998 19:40:02 -0700 (PDT)
From:      Bill Fenner <fenner@parc.xerox.com>
To:        freebsd-bugs@FreeBSD.ORG
Subject:   Re: bin/6646: dump(8) using remote tape drive is too slow 
Message-ID:  <199805170240.TAA18475@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR bin/6646; it has been noted by GNATS.

From: Bill Fenner <fenner@parc.xerox.com>
To: wataru-s@mfeed.ad.jp
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: bin/6646: dump(8) using remote tape drive is too slow 
Date: Sat, 16 May 1998 19:36:20 PDT

 This is an interaction between ACK-every-other, delayed ACKs, and Silly
 Window Syndrome avoidance.  A fix which is better than turning down the
 MSS is to uncomment the TCP_NODELAY code; this turns off the sender
 side of SWS avoidance.
 
 The gory details:
 
 First, a bit about the rdump protocol.  It's a request/result protocol,
 with requests that look like "W10240\n" (meaning "write 10240 bytes"),
 followed by the 10240 bytes to write.  The remote tape daemon then
 responds with the result, normally something like "A10240\n" (e.g.
 I did your write and it completed successfully by returning 10240.)
 
 This isn't TCP's "normal" bulk transfer mode, in which the sender
 just sends and keeps sending; there are round-trips to perform the
 request/result.  The client sends a request including the data, and
 the server waits for all of the data to arrive before sending the
 reply.
 
 At an MSS of 1440, 10240 is 7 and 1/9 packets; 7 full-sized packets and
 1 160-byte packet.  The 7-byte request and the 7 full-sized data
 packets all go out fine, since SWS-avoidance always allows full-sized
 packets:
 
 15:27:37.799556 mango.830 > bigburnt.cmd: P 30780: 30787(7) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799608 mango.830 > bigburnt.cmd: . 30787: 32227(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799650 mango.830 > bigburnt.cmd: . 32227: 33667(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799697 mango.830 > bigburnt.cmd: . 33667: 35107(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799709 mango.830 > bigburnt.cmd: . 35107: 36547(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799739 mango.830 > bigburnt.cmd: . 36547: 37987(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799768 mango.830 > bigburnt.cmd: . 37987: 39427(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 15:27:37.799779 mango.830 > bigburnt.cmd: . 39427: 40867(1440) ack 26 win 17255 <nop,nop,timestamp 168322 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 
 Now we get back 3 ACK's, since bigburnt is ACK'ing every other packet:
 
 15:27:37.803520 bigburnt.cmd > mango.830: . ack 33667 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF)
 15:27:37.808577 bigburnt.cmd > mango.830: . ack 36547 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF)
 15:27:37.808657 bigburnt.cmd > mango.830: . ack 39427 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF)
 
 Now, SWS avoidance tells us that we still may not send our 160-byte
 packet; it doesn't fulfill any of the 3 rules:
 1. It's an MSS-sized segment (not true)
 2. The segment is larger than half the receiver's buffer (not true)
 3. All outstanding data is ACK'd (not true) and this is all the data
 in our send buffer (true)
 
 We have to wait for one of these conditions to become true before we
 may send the 160-byte packet.
 
 After 200ms, bigburnt's delayed-ACK timer fires and it transmits the
 ACK for the 7th packet (this is part of ACK-every-other).  At this
 time, condition (3) becomes true (all outstanding data is now ACK'd) so
 we may send our 160-byte packet.
 
 15:27:37.998895 bigburnt.cmd > mango.830: . ack 40867 win 10240 <nop,nop,timestamp 1198654 168322,nop,nop,cc 2548> (DF)
 15:27:37.998931 mango.830 > bigburnt.cmd: P 40867: 41027(160) ack 26 win 17255 <nop,nop,timestamp 168323 1198654,nop,nop,cc 496> (DF) [tos 0x8]
 
 Finally, the server sends us our result:
 
 15:27:37.999335 bigburnt.cmd > mango.830: P 26:33(7) ack 41027 win 10240 <nop,nop,timestamp 1198654 168323,nop,nop,cc 2548> (DF)
 
 The 200ms delay every 7 packets is what's killing performance.  This
 problem was much less severe before ACK-every-other was introduced;
 if every packet is ACK'd, the delay is reduced from a delayed-ACK
 timeout to the RTT of the connection.
 
 This doesn't happen when the MSS is 1024, because the 10240 byte buffer
 size is evenly divisible by the MSS, so rule (1) is always true.  This
 isn't a particularly great workaround, since there's no way to guarantee
 that the MSS will be as large as 1024, and setting the MSS to 1024 when
 it can be larger hurts performance.  Setting TCP_NODELAY on the socket
 removes the first half of step 3 of SWS avoidance, meaning that it will
 always send when it's all of the data in the socket buffer.
 
 I'm not sure what I think the TCP solution is; applications shouldn't
 have to set TCP_NODELAY just because (blocksize div MSS) % 2 != 0.
 (Substitute N for 2 for "ACK-every-N" peers.)  I think the answer is
 that condition (3) SWS avoidance has to be modified to take into
 account the other end's ACK frequency.
 
   Bill

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199805170240.TAA18475>