Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Jun 2008 23:47:33 +0200
From:      Ali Niknam <freebsd-net@transip.nl>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        net@freebsd.org
Subject:   Re: FreeBSD 7.0: sockets stuck in CLOSED state...
Message-ID:  <4862BCF5.4070900@transip.nl>
In-Reply-To: <20080625195523.N29013@fledge.watson.org>
References:  <486283B0.3060805@transip.nl> <20080625195523.N29013@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Robert,

> Sounds like there's a bug somewhere.  Before we start trying to track it 
[...]
> So, with that introduction, we're interested in resolving:
> 

Quite comprehensive indeed; thank you for all that information. I was 
not aware that there was a decoupling between the various parts of the 
abstractions, but now that I think of it, it's more or less logical I guess.

> The first is the easiest to resolve, as all we need to do is see whether 
[...]
> the file descriptor numbers being returned to see whether, perhaps, that 
> number only goes up over time, and gets really big.
> 

My personal feeling is that it's a race condition; no idea why, but it 
feels that way. Maybe because it's such a small number as compared to 
the big amount of connections that takes place.

I do not leak file descriptors as far as I can see, I can send you the 
information you ask for (netstat, sockstat, fstat, etc.) offlist if you 
like, or if you prefer, I can give you access to the machine, please let 
me know whichever you like.

I'd like to reiterate that at this moment i'm not sure at all if it's my 
code, or kernel code. However I've seen, for my feeling, sufficient 
information to reasonably suspect that it _might_ be something outside 
my code :).

> wedged-up state. It would be most helpful if you could actually shut 
> down to single-user mode, killing all user processes, then waiting ten 
> minutes, and capturing the output of those above commands to files that 
> you can then e-mail to me.
> 

Because it's a live machine that would be very difficult. Maybe, if you 
really really need it that way and we can't find another way I can 
announce maintainance and do it in the middle of the night :).

> Without accusing you of having buggy code, I should say that I think 
> there's a reasonable chance that what you're seeing is an interaction 
> between an existing leak of resources in the application and the way the 
> kernel state management has changed.  The output from netstat pretty 

Yes that was the first thing I though of as well, however, especially 
one of the two applications is so simple that I would be ashamed to 
death if I still had a bug in there :). If it turns out that way: 
sssstttt ;).

> precisely matches that what you'd expect: lots of TCP connections in the 
> CLOSED state reflecting a series of connections built by an application 
> but then not properly discarded. Likewise, when the application is 
> killed, all of the connections go away -- most likely because the file 
> descriptors are all closed, allowing them to be garbage collected and 
> connection state freed.  If it is this sort of bug, then most likely 
> you're missing a call to close() in a work loop somewhere, and in some 
> exceptional case, you fall out of the loop without calling close().
> 

I will double check this once more, but honestly, i strongly doubt it...

Also one other thing that I've noticed, is that it's always the input 
buffer that has bytes left; never the output buffer...

Moreover, i've seen that close() reports EBADF, but due to the insane 
amount of connections I can not say for certain that that's when the 
connection goes into CLOSED state. The ip's do match, but it's very 
common for the same ip's to make numerous connections too.

Kind Regards,

Ali


-- 
   Transip BV | http://www.transip.nl/
   We never let you down.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4862BCF5.4070900>