Date: Tue, 11 Jan 2005 03:23:42 +0100 (CET) From: Lars Erik Gullerud <lerik@nolink.net> To: Len Conrad <LConrad@Go2France.com> Cc: freebsd-net@freebsd.org Subject: Re: buildup of Windows time_wait talking to fbsd 4.10 Message-ID: <20050111025252.L88996@electra.nolink.net> In-Reply-To: <6.1.1.1.2.20050110103857.045a9a68@81.255.84.73> References: <6.1.1.1.2.20050110103857.045a9a68@81.255.84.73>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 10 Jan 2005, Len Conrad wrote: > We have a windows mailserver that relays its outbound to a fbsd gateway. We > changed to a different fbsd gateway running 4.10. Windows then began having > trouble sending to 4.10. Windows "netstat -an" shows dozens of lines like > this: > > source IP desitination IP > ====================================================================== > TCP 10.1.16.3:1403 192.168.200.59:25 TIME_WAIT [snip] > Eventually, the windows SMTP logs line like "cannot connect to remote IP" or > "address already in use" because no local tcp/ip sockets are available, we > think. > > The new gateway/fbsd 4.10 "sockstat -4" shows no corresponding tcp > connections when the Windows server is showing as above. On the fbsd 4.10 > machines, smtp logs, syslog, and dmesg show no errors. > > We switch the windows box to smtp gateway towards the old box/fbsd 4.7, all > is cool. OK, let me play a wild hunch here - if you look at netstat -na output on the 4.7 machine (the one that works) when you are using that one, you see a large number of connections in the TIME_WAIT state on that side, while none on the Windows-server? I had a similar situation with an application we use that also opens a large number of TCP sessions from a Windows server to a FreeBSD server - that suddenly stopped working when the application in question was upgraded on the server it connected to. In our case, it turns it it was a timing issue that changed on the new version of the application. When a TCP connection is closing, one side of the connection typically initiates the close, and sends a FIN,ACK packet to the other side. After going through the steps of closing down the socket, the side that initiated the close, will leave the socket in TIME-WAIT state for 2 MSL (Maximum Segment Lifetime - which defaults to 2 mins, so 4 min wait) - while the other end transitions to CLOSED state (and tears down the socket) immediately, without this wait period. (The exception being if both ends send FIN,ACK at the same time, in which case they both go to TIME-WAIT). What happened with in our case, on the old version of the application, was that as soon as the client started to log off the session, the server-side application (on the FreeBSD server) would initiate closing of the TCP-session, and thereby being the originator (and getting a large number of sessions in TIME-WAIT - which was not a problem for the BSD box). While the Windows machine closed it's socket immediately and was happy all the time. However, after we upgraded the application, when the client logged off at the application level, the server-side app would first take 2-3 seconds to process various shutdown-related activities, and the client end (on the Windows machine) got "impatient" and initiated the TCP session close from it's side. Leaving all the TIME-WAIT sockets hanging on the Windows side, rather than the FreeBSD side. Now, newer versions of Windows have a ridiculously low number of max simultaneous connections configured, and we started seeing exactly the same kinds of errors you are describing, due to a large number of TIME-WAIT sockets. We had to adjust the server-side application to tear down the TCP socket first, THEN do its internal shutdown processing, in order to not leave the Windows client in a jam. The alternative was to increase the number of simultaneous connections on the Windows machine, which involves some registry black magic, and we found this to be the easier way out (then - we will probably hack the Windows regkeys if we start seeing the issue again). You didn't mention what MTA you are using, so I don't know if this is a similar (application-level) issue, or if it's FreeBSD 4.10 that causes some additional delay before initiating a TCP CLOSE, but either way, this might be the behaviour you are observing, in which case you will need to figure out how to get the FreeBSD side to tear down the connection, or preferably you should look at tuning some registry stuff on your Windows server - like setting the MSL time (default 2 minutes) to a much lower value, and perhaps upping the no. of max simultaneous connections. HTH, /leg
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050111025252.L88996>