Date: Wed, 16 Jan 2002 15:50:47 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Chad David <davidc@acns.ab.ca> Cc: current@freebsd.org Subject: Re: socket shutdown delay? Message-ID: <3C4611D7.F99A5147@mindspring.com> References: <20020116070908.A803@colnta.acns.ab.ca> <3C45F32A.5B517F7E@mindspring.com> <20020116152908.A1476@colnta.acns.ab.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
Chad David wrote: > > A connection goes into FIN_WAIT_2 when it has received the ACK > > of the FIN, but not received a FIN (or sent an ACK) itself, thus > > permitting it to enter TIME_WAIT state for 2MSL before proceeding > > to the CLOSED state, as a result of a server initiated close. > > > > A connection goes into LAST_ACK when it has sent a FIN and not > > received the ACK of the FIN before proceeding to the CLOSED > > state, as a result of a client initiated close. > > I've got TCP/IP Illistrated V1 right beside me, so I basically > knew what was happening. Just not why. > > Like I said in the original email, connections from another machine > end up in TIME_WAIT right away, it is only local connection. Maybe there is a bug in the interrupt thread code, or in the scheduler for NETISR processing. Like I said before, I think this is unlikely. The other possibility is a bug in simultaneous client and server closes, but without information about your client and server program's operation (e.g. if it's an HTTP session, and the client closes without waiting for a response, or the server responsed and closes), that's as close as I can give you. I *really* doubt that, since I think it would have shown before. The other possibility might be the sequence numbers on a re-used connection going backwards. If that were to happen, you might see the sate machien push pack into LAST_ACK when it shouldn't. Be sure that you use the sysctl to set the sequence number algorithm to the one specified in the RFC, instead of the broken OpenBSD version that supposedly prevents predictive session hijack (which should be an application level thing about verification of the peer, anyway). Also make sure that the keepalive sysctl is set on (1). > > Since it's showing IP addresses, you appear to be using real > > network connections, rather than loopback connections. > > In this case yes. Connections to 127.0.0.1 result in the same thing. OK, so it's not lost packets because of the use of the network driver. This makes me lean toward the sequence number or RST with no mbufs available problem. [ ... test net intentionally lossy ... ] > Nothing like that on the box. OK. It was low hanging fruit, but unlikely, but had to be asked. > > 2) You have intentionally disabled KEEPALIVE, so that > > a close results in an RST instead of a normal > > shutdown of the TCP connection (I can't tell if > > you are doing a real call to "shutdown(2)", or if > > you are just relying on the OS resource tracking > > behaviour that is implicit to "close(2)" (but only > > if you don't set KEEPALIVE, and have disabled the > > sysctl default of always doing KEEPALIVE on every > > connection). In this case, it's possible that the > > RST was lost on the wire, and since RSTs are not > > retransmitted, you have shot yourself in the foot. > > > > Note: You often see this type of foolish foot > > shooting when running MAST, WAST, or > > webbench, which try to factor out response > > speed and measure connection speed, so that > > they benchmark the server, not the FS or > > other OS latencies in the document delivery > > path (which is why these tools suck as real > > world benchmarks go). You could also cause > > this (unlikely) with a bad firewall rule. > > I haven't changed any sysctls, and other than SO_REUSEADDR, > the default sockopts are being used. This doesn't tell me the setting of the keepalive sysctl. By default, it won't be on unless the sysctl forces it on, which it does by default, unless it's been changed, or the default has been changed in -current (don't know). So check this one. > I also do not call > shutdown() on either end, and both the client and server > processes have exited and the connections still do not clear > up (in time they do, around 10 minutes). You should probably call shutdown(2), if you want your code to be mostly correct. You also didn't say that they in fact drain after that period of time. I suspect that you are just doing a large number of connections. I frequently ended up with 50,000+ connections in TIME_WAIT state (I rarely use the same machine for both the client and the server, since that is not representative of real world use), and, of course, it takes 2MSL for TIME_WAIT to drain connections out. My guess is that you have ran out of mbufs (your usage stats tell me nothing about the abailable number of real mbufs; even the "0 requests for memory denied" is not really as useful as it would appear in the stats), or you just have an incredibly large number of files open. The FreeBSD file allocation table entry allocation for a large number of simultaneously open files is bad. Similarly, the FreeBSD allocation of the port space is a linear lookup that has exponential time increase as the number of connections go up. The same is true of the lookup of the INPCB and TCPCB on incoming packets. It would be useful to log state transitions for a connection case known to be bad -- that is, log the states starting after the problem has started with a new connection pair or ten, in order to see what's getting lost where. > > 3) You've exhausted your mbufs before you've exhausted > > the number of simultaneous connections you are > > permitted, because you have incorrectly tuned your > > kernel, and therefore all your connections are sitting > > in a starvation deadlock, waiting for packets that can > > never be sent because there are no mbufs available. > > The client eventually fails with EADDRNOTAVAIL. Yes, this is the outbound connection limitation because of the ports. There's three bugs there, in FreeBSD, as well, but they generally limit the outbound connections, rather than causing problems. One tuning variable you probably want on the machine making the connections is to up the TCP port range to 65535; you will have to do two sysctls in order to do this. This will delay your client failure by about a factor of 8-10 times as many connections (outbound connections count against the total, but inbound connections do not, since they do not use up socket/port pairs be source). > Allocated mbuf types: > 102 mbufs allocated to data These are probably TCP options on otherwise idle connections. > 0% of mbuf map consumed > mbuf cluster usage: > GEN list: 0/0 (in use/in pool) > CPU #0 list: 58/86 (in use/in pool) > CPU #1 list: 43/88 (in use/in pool) > Total: 101/174 (in use/in pool) > Maximum number allowed on each CPU list: 128 > Maximum possible: 33792 > 0% of cluster map consumed > 420 KBytes of wired memory reserved (54% in use) I'm not sure if the 54% is of the available or max wired. If the max, this could be your problem. > colnta->netstat -an | grep FIN_WAIT_2 | wc > 2814 16884 219492 > > and a few minutes later: > colnta->netstat -an | grep FIN_WAIT_2 | wc > 1434 8604 111852 This indicates a 2MSL draining. The resource track close could also be slow. You could probably get an incredible speedup by doing explicit closes in the client program, starting with the highest used fd, and working down, instead of going the other way (it's probably a good idea to modify the FreeBSD resource track close to so the same thing). There are some other inefficiencies in the fd code that can be addressed... nominally, the allocation is a linear search at the last valid one going higher. For most servers, this could be significantly improved by linking free fd's in a sparse list onto a "freelist", and maintaining a pointer to that, instead of the index to the first free one, but that should only impact you on allocation (like the inpcb hash, which fails pretty badly, even when you tune up the hash size to some unreasonable amount, and the port allocation for outbound connections, which is, frankly, broken. Both could benefit from a nice btree overhaul). THe timer code is also pretty sucky, even with a very large callout wheel. It would be incredibly valuable to have fixed interval timers ordered by entry on interval specific lists (e.g. MSL and 2MSL lists, as well as other common ones), so that the scan of the timer entries could be stopped at the first one whose expiration time was after the current time for the given interval callout. This would save you almost all of your callout list traversals, which, with the wheel, have to be ordered (see the Rice University paper on opportunistic timers for a glancing approach at solving the real problem here). These aren't biting you, though, because the quick draining is happening, indicating that it's not really the timer code or the other code that's your immediate problem (though you might speed draining by a factor of 3 just by fixing the timers to use ordered lists per interval, rather than the callout wheel). > The box currently has 630MB free memory, and is 98.8% idle. OK, this means that you aren't getting anywhere near the KVA limits, and that you aren't eating as much of core as you might be otherwise. In practice, you can reserve as much as 50% of physical memory for use in mbufs, if you are tuned correctly. The limits that implies, assuming you are sending a lot of data, are 315MB/32k ~= 10,000 client and server connections, or 20,000 server only connections (if the client is on another machine). After that, your transmit windows ont he server and receive windows on the client are full. OOPS. Halve that for -current, since the default window size was doubled. OK... you very well could be hitting the limits here, with the number of sockets available and the amount of memory you have to burn in mbufs. Try setting your max window size down to 16k, or even 8k (you really want to set the transmit windows large and the receive windows small for a server, but that's not an option in the current code, I think, and anyway, since you are running both on the same machine, that makes it impossible for you to tune a single machine for optimal performance as only a client or only a server. > I'm not sure what other information would be useful? See above. > > 4) You've got local hacks that your aren't telling us > > about (shame on you!). > > Nope. Stock -current, none of my patches applied. Heh... "not useful information without a date of cvsup, and then possibly not even then". Moving target problems... Can you repeat this on 4.5RC? If so, try 4.4-RELEASE. It may be related to the SYN cache code. The SYN-cookie code is vulnerable to the "ACK gun" attack, and since the SYN cache code falls back into SYN cookie (it assumes that the reason it didn't find the corresponding SYN in the SYN cache is that it overflowed and was discarded, turning naked ACK attempts into SYN-cookie attempts completely automatically), you might be hitting it that way. If that's the case, then I suggest leaving the SYN cache enabled, and disabling the SYN cookie. If that doesn't fix it, then you may also want to try disabling the SYN cache. Other than that, once you've tried this, then I will need to know what the failure modes are, and then more about the client and server code (kqueue based? Standard sockets based?), and then I can suggest more to narrow it down. Another thing you may want to try is delay closing the server side of the connection for 1-2 seconds after the last write. This is the canonical way of forcing a client to do the close first in all cases, which totally avoids the server-side-close-first case, which also avoids the FIN_WAIT_2. For real code, you would have to add a "close cache" and timer. Hope this helps... -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C4611D7.F99A5147>