Date: Tue, 12 Aug 2014 05:03:15 -0700 From: Adrian Chadd <adrian@freebsd.org> To: Vlad Zolotarov <vladz@cloudius-systems.com> Cc: FreeBSD Net <freebsd-net@freebsd.org>, Osv Dev <osv-dev@googlegroups.com> Subject: Re: TCP Rx window auto sizing relies on TCP timestamp option? Message-ID: <CAJ-VmokJQiSeH=tj2GTD=wwMR0jSMYMnz3Xs7UW8yVD5ShK_Lw@mail.gmail.com> In-Reply-To: <53E9FF32.3010802@cloudius-systems.com> References: <53E8B424.2000904@cloudius-systems.com> <20140811170606.GV83475@funkthat.com> <53E9FF32.3010802@cloudius-systems.com>
next in thread | previous in thread | raw e-mail | index | archive | help
The TL;DR is - yes, I bet it'd be nice to have. :) -a On 12 August 2014 04:49, Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > On Aug 11, 2014 8:06 PM, "John-Mark Gurney" <jmg@funkthat.com > <mailto:jmg@funkthat.com>> wrote: >> >> Vlad Zolotarov wrote this message on Mon, Aug 11, 2014 at 15:16 +0300: >> > Hi, I have the most strange question about the TCP Rx window auto sizing >> > implementation in a FreeBSD networking stack. >> > When I looked at the FreeBSD code (hash >> > 9abce0e567c9a5a0520cdd94d5c633c7baf9a184) I noticed that >> > the mentioned above feature will not be "enabled" if there isn't a TCP >> > timestamp option present in the current TCP session: >> > >> > See sys/netinet/tcp_input.c: line 1813 in tcp_do_segment() function: >> > >> > if (V_tcp_do_autorcvbuf && >> > *to.to_tsecr* && <-------- this is what I'm >> > talking about >> > (so->so_rcv.sb_flags & SB_AUTOSIZE)) >> > >> > So, if i read the code correctly, if there isn't a TS option (negotiated >> > and thus present in every received packet) the receive socket buffer >> > won't grow thus preventing the growth of the Rx window. >> > If that's the case this is very strange since TS option is not promised >> > and even more - in many cases it won't be present. >> > For example in Linux this feature is disabled by default (controlled by >> > /proc/sys/net/ipv4/tcp_timestamps). >> > This is how I actually noticed the problem the first place: I ran iperf >> > test where Linux was an initiator and a transmitter (iperf -c) FreeBSD >> > box was a receiver (iperf -s) and I noticed that the Rx window wasn't >> > opening up because Linux box hasn't negotiated the TS option in the SYN. >> > As a result, the throughput numbers were significantly lower compared to >> > Linux-to-Linux setup (Linux uses a Dynamic Right-Sizing (DRS) algorithm >> > http://public.lanl.gov/radiant/pubs.html#DRS, which doesn't rely on TS). >> > >> > Could anybody comment on this, pls.? >> > Did I miss anything? >> > Is it true that FreeBSD assumes that TS option is always present and if >> > not how can I cause an Rx Window to open up when TS option hasn't been >> > negotiated? >> >> This means the receive buffer won't grow beyond the default of 64k... >> But, as the comment says: >> * On the receive side the socket buffer memory is only >> rarely >> * used to any significant extent. This allows us to be >> much >> >> The receive buffer will only get used if the application takes too long >> to read it's buffer, or it isn't currently waiting... If that's the >> case, then the application should be fixed to be able to process the >> data as quickly as it comes in... > > U r right about the Rx buffer and as a result the Rx window will not grow > beyond this value too. > > See the following lines: > > tcp_output.c: tcp_output(): > > line 509: > > recwin = sbspace(&so->so_rcv); > > > line 1034: > > /* > * According to RFC1323 the window field in a SYN (i.e., a <SYN> > * or <SYN,ACK>) segment itself is never scaled. The <SYN,ACK> > * case is handled in syncache. > */ > if (flags & TH_SYN) > th->th_win = htons((u_short) > (min(sbspace(&so->so_rcv), TCP_MAXWIN))); > else > th->th_win = htons((u_short)(recwin >> tp->rcv_scale)); > > > As a result the Tx window of a transmitter will not grow beyond 64K as well > and this is a single full LSO/LRO frame. > So this will limit a transmitter by a single LSO frame (64K) frame per RTT > since the receiver will only "see" the new bytes only after they are > delivered by a HW and this will be after all 64KB (full LRO aggregation) are > received and only then it will send an ACK. > > Now let's consider u have a 0.2ms RTT like I have on my setup with 40Gbps > ConnectX 3 NICs connected back to back. > So, in this case the best throughput u'll ever get with the 64K window will > be 8*64K/0.2ms ~ 2.5Gbps which is 1/16 of a line rate and u need at least > 64K*16 ~ 1MB window to reach the line rate. And the higher RTT the larger > Window we'll need. And this is in case the application frees the socket > buffer immediately once it arrives which may never be the case of course. > > I suppose use cases like above were exactly the motivation for Window > Scaling option in RFC 1323. > > >> >> So, I don't see much of an issue w/ the code you pointed out, yes, >> the receive buffer won't grow, > >> but there are options that you can set >> (sysctl net.inet.tcp.recvspace) and SO_RCVBUF in the application that >> will address it otherwise... > > Exactly! If there is no TS - it won't and FreeBSD will not be able to > utilize the network link. > Frankly, I don't understand your advice - u suggest for each and every > application to go and manually configure a receive socket buffer size? Or > increase the initial socket buffer globally, which is even worse?! And which > value should we choose? As u may see above the proper value depends on the > RTT and RTT may change while application runs due to routing change. I doubt > your suggestion is feasible. > > So, my first question stands - doesn't FreeBSD community think that it would > be beneficial for FreeBSD to use a DRS (or similar?) algorithm when there > are no TS negotiated? > > thanks, > vlad > > >> >> Obviously setting the default too large will just waste memory... >> >> -- >> John-Mark Gurney Voice: +1 415 225 5579 >> >> "All that I will do, has been done, All that I have, has not." > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmokJQiSeH=tj2GTD=wwMR0jSMYMnz3Xs7UW8yVD5ShK_Lw>