Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 11 Jan 2005 03:23:42 +0100 (CET)
From:      Lars Erik Gullerud <lerik@nolink.net>
To:        Len Conrad <LConrad@Go2France.com>
Cc:        freebsd-net@freebsd.org
Subject:   Re: buildup of Windows time_wait talking to fbsd 4.10
Message-ID:  <20050111025252.L88996@electra.nolink.net>
In-Reply-To: <6.1.1.1.2.20050110103857.045a9a68@81.255.84.73>
References:  <6.1.1.1.2.20050110103857.045a9a68@81.255.84.73>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 10 Jan 2005, Len Conrad wrote:

> We have a windows mailserver that relays its outbound to a fbsd gateway.  We 
> changed to a different fbsd gateway running 4.10. Windows then began having 
> trouble sending to 4.10.  Windows "netstat -an" shows  dozens of lines like 
> this:
>
>         source IP              desitination IP
> ======================================================================
>  TCP    10.1.16.3:1403         192.168.200.59:25      TIME_WAIT
[snip]

> Eventually, the windows SMTP logs line like "cannot connect to remote IP" or 
> "address already in use" because no local tcp/ip sockets are available, we 
> think.
>
> The new gateway/fbsd 4.10 "sockstat -4" shows no corresponding tcp 
> connections when the Windows server is showing as above.  On the fbsd 4.10 
> machines, smtp logs, syslog, and dmesg show no errors.
>
> We switch the windows box to smtp gateway towards the old box/fbsd 4.7, all 
> is cool.

OK, let me play a wild hunch here - if you look at netstat -na output on 
the 4.7 machine (the one that works) when you are using that one, you see 
a large number of connections in the TIME_WAIT state on that side, while 
none on the Windows-server?

I had a similar situation with an application we use that also opens a 
large number of TCP sessions from a Windows server to a FreeBSD server - 
that suddenly stopped working when the application in question was 
upgraded on the server it connected to. In our case, it turns it it was a 
timing issue that changed on the new version of the application.

When a TCP connection is closing, one side of the connection typically 
initiates the close, and sends a FIN,ACK packet to the other side. After 
going through the steps of closing down the socket, the side that 
initiated the close, will leave the socket in TIME-WAIT state for 2 MSL 
(Maximum Segment Lifetime - which defaults to 2 mins, so 4 min wait) - 
while the other end transitions to CLOSED state (and tears down the 
socket) immediately, without this wait period. (The exception being if 
both ends send FIN,ACK at the same time, in which case they both go to 
TIME-WAIT).

What happened with in our case, on the old version of the application, 
was that as soon as the client started to log off the session, the 
server-side application (on the FreeBSD server) would initiate closing of 
the TCP-session, and thereby being the originator (and getting a large 
number of sessions in TIME-WAIT - which was not a problem for the BSD 
box). While the Windows machine closed it's socket immediately and was 
happy all the time.

However, after we upgraded the application, when the client logged off 
at the application level, the server-side app would first take 2-3 seconds 
to process various shutdown-related activities, and the client end (on 
the Windows machine) got "impatient" and initiated the TCP session close 
from it's side. Leaving all the TIME-WAIT sockets hanging on the Windows 
side, rather than the FreeBSD side.

Now, newer versions of Windows have a ridiculously low number of max 
simultaneous connections configured, and we started seeing exactly the 
same kinds of errors you are describing, due to a large number of 
TIME-WAIT sockets. We had to adjust the server-side application to tear 
down the TCP socket first, THEN do its internal shutdown processing, in 
order to not leave the Windows client in a jam. The alternative was to 
increase the number of simultaneous connections on the Windows machine, 
which involves some registry black magic, and we found this to be the 
easier way out (then - we will probably hack the Windows regkeys if we 
start seeing the issue again).

You didn't mention what MTA you are using, so I don't know if this is a 
similar (application-level) issue, or if it's FreeBSD 4.10 that causes 
some additional delay before initiating a TCP CLOSE, but either way, this 
might be the behaviour you are observing, in which case you will need to 
figure out how to get the FreeBSD side to tear down the connection, or 
preferably you should look at tuning some registry stuff on your 
Windows server - like setting the MSL time (default 2 minutes) to a much 
lower value, and perhaps upping the no. of max simultaneous connections.

HTH,

/leg



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050111025252.L88996>