Date: Wed, 12 Jan 2022 10:48:59 -0800 From: Gleb Smirnoff <glebius@freebsd.org> To: net@freebsd.org Cc: current@freebsd.org Subject: compressed TIME-WAIT to be decomissioned Message-ID: <Yd8im/VkTU1zdvOi@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
Hi! [crossposted to current@, but let's keep discussion at net@] I have already touched the topic with rrs@, jtl@, tuexen@, rscheff@ and Igor Sysoev (author of nginx). Now posting for wider discussion. TLDR: struct tcptw shall be decomissioned Longer version covers three topics: why does tcptw exist? why is it no longer necessary? what would we get removing it? Why does struct tcptw exist? When TCP connection goes to TIME-WAIT state, it can only retransmit the very last ACK, thus doesn't need all of the control data in the kernel. However, we are required to keep it in memory for certain amount of time (2*MSL). So, let's save memory: free the socket, free the tcpcb and leave only inpcb that will point at small tcptw (much smaller than tcpcb) that holds enough info to retransmit the last ACK. This was done in early 2003, see 340c35de6a2. What was different in 2003 compared to 2022? * First of all, internet servers were running i386 with only 2 Gb of KVA space. Unlike today, they were memory constrained in the first place, not CPU bound like they are today. * Many of HTTP connections were made by older browsers, which were not able to use persistent HTTP connections. Those browsers that could, would recycle connections more often, then today. Default timeouts in Apache for persistent connections were short. So, the ratio of connections in TIME-WAIT compared to live connections was much bigger than today. Here is sample data from 2008 provided to me by Igor Sysoev: ITEM SIZE LIMIT USED FREE REQUESTS FAILURES tcpcb: 728, 163840, 22938, 72722, 13029632, 0 tcptw: 88, 163842, 10253, 72949, 2447928, 0 We see that TIME-WAITs are ~ 50% of live connections. Today I see that TIME-WAITs are ~ 1% of connections. My data is biased here, since I'm looking at servers that do mostly video streaming. I'd be grateful if anybody replies to this email with some other modern data on ratio between tcpcb and tcptw allocations. * The Internet bandwidth was lower and thus average size of HTTP object much smaller. That made the average send socket buffer size much smaller than today. Note that TCP socket buffers autosizing came in 2009 only. This means that today most significant portion of kernel memory consumed by an average TCP connection is the send socket buffer, and socket+inpcb+tcpcb is just a fraction of that. Thus, swapping tcpcb to tcptw we are saving a fraction of a fraction of memory consumed by average connection. * Who told that 2*MSL (60 seconds) is adequate time to keep TIME-WAIT? In 71d2d5adfe1 I added some stats on usage of tcptw and experimented a bit with lowering net.inet.tcp.msl. It appeared that lowering it down three times doesn't have statistically significant effect on TIME-WAIT use stats. This means that the already miniscule number of TIME-WAIT connection on a modern HTTP server can be lowered 3 times more. Feel free to lower net.inet.tcp.msl and do your own measurements with 'netstat -sp tcp | grep TIME-WAIT'. I'd be glad to see your results. Ok, now what would removal give us? * One less alloc/free during socket lifetime (immediately). * Reduced code complexity. inp->inp_ppcb always can be dereferenced as tcpcb. Lot's of checking for inp->inp_flags & INP_TIMEWAIT goes away (eventually). * Shrink of struct inpcb. Today inpcb has some TCP-only data, e.g. HPTS. Reason for that is obvious - compressed TIME-WAIT. A HPTS-driven connection may transition to TIME-WAIT, so we can't use tcpcb. Now we would be able to. So, for non TCP connections memory footprint shrinks (with following changes). * Embedding inpcb into protocols cb. An inpcb becomes one piece of memory with tcpcb. One more less alloc/free during socket lifetime. Reduced code complexity, since now inpcb == tcpb (following changes). How much memory are we going to lose? (kgdb) p tcpcb_zone->uz_keg->uk_rsize $5 = 1064 (kgdb) p tcptw_zone->uz_keg->uk_rsize $6 = 72 (kgdb) p tcpcbstor->ips_zone->uz_keg->uk_rsize $8 = 424 After change a connection in TIME-WAIT would consume 424+1064 bytes instead of 424+72. Multiply that by expected number of connections in TIME-WAIT on your machine. Comments welcome. -- Gleb Smirnoff
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Yd8im/VkTU1zdvOi>