From owner-freebsd-net@FreeBSD.ORG  Wed Jun 25 21:47:53 2008
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 902E0106567B
	for <net@freebsd.org>; Wed, 25 Jun 2008 21:47:53 +0000 (UTC)
	(envelope-from freebsd-net@transip.nl)
Received: from relay0.transip.nl (relay0.transip.nl [80.69.67.21])
	by mx1.freebsd.org (Postfix) with ESMTP id 42C988FC1E
	for <net@freebsd.org>; Wed, 25 Jun 2008 21:47:53 +0000 (UTC)
	(envelope-from freebsd-net@transip.nl)
Received: from [192.168.0.3] (ip86-50-212-87.adsl2.static.versatel.nl
	[87.212.50.86])
	by relay0.transip.nl (Postfix) with ESMTP id 54BDE1036BA;
	Wed, 25 Jun 2008 23:47:49 +0200 (CEST)
Message-ID: <4862BCF5.4070900@transip.nl>
Date: Wed, 25 Jun 2008 23:47:33 +0200
From: Ali Niknam <freebsd-net@transip.nl>
Organization: Transip BV
User-Agent: Thunderbird 2.0.0.14 (Windows/20080421)
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <486283B0.3060805@transip.nl>
	<20080625195523.N29013@fledge.watson.org>
In-Reply-To: <20080625195523.N29013@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: net@freebsd.org
Subject: Re: FreeBSD 7.0: sockets stuck in CLOSED state...
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Jun 2008 21:47:53 -0000

Hi Robert,

> Sounds like there's a bug somewhere.  Before we start trying to track it 
[...]
> So, with that introduction, we're interested in resolving:
> 

Quite comprehensive indeed; thank you for all that information. I was 
not aware that there was a decoupling between the various parts of the 
abstractions, but now that I think of it, it's more or less logical I guess.

> The first is the easiest to resolve, as all we need to do is see whether 
[...]
> the file descriptor numbers being returned to see whether, perhaps, that 
> number only goes up over time, and gets really big.
> 

My personal feeling is that it's a race condition; no idea why, but it 
feels that way. Maybe because it's such a small number as compared to 
the big amount of connections that takes place.

I do not leak file descriptors as far as I can see, I can send you the 
information you ask for (netstat, sockstat, fstat, etc.) offlist if you 
like, or if you prefer, I can give you access to the machine, please let 
me know whichever you like.

I'd like to reiterate that at this moment i'm not sure at all if it's my 
code, or kernel code. However I've seen, for my feeling, sufficient 
information to reasonably suspect that it _might_ be something outside 
my code :).

> wedged-up state. It would be most helpful if you could actually shut 
> down to single-user mode, killing all user processes, then waiting ten 
> minutes, and capturing the output of those above commands to files that 
> you can then e-mail to me.
> 

Because it's a live machine that would be very difficult. Maybe, if you 
really really need it that way and we can't find another way I can 
announce maintainance and do it in the middle of the night :).

> Without accusing you of having buggy code, I should say that I think 
> there's a reasonable chance that what you're seeing is an interaction 
> between an existing leak of resources in the application and the way the 
> kernel state management has changed.  The output from netstat pretty 

Yes that was the first thing I though of as well, however, especially 
one of the two applications is so simple that I would be ashamed to 
death if I still had a bug in there :). If it turns out that way: 
sssstttt ;).

> precisely matches that what you'd expect: lots of TCP connections in the 
> CLOSED state reflecting a series of connections built by an application 
> but then not properly discarded. Likewise, when the application is 
> killed, all of the connections go away -- most likely because the file 
> descriptors are all closed, allowing them to be garbage collected and 
> connection state freed.  If it is this sort of bug, then most likely 
> you're missing a call to close() in a work loop somewhere, and in some 
> exceptional case, you fall out of the loop without calling close().
> 

I will double check this once more, but honestly, i strongly doubt it...

Also one other thing that I've noticed, is that it's always the input 
buffer that has bytes left; never the output buffer...

Moreover, i've seen that close() reports EBADF, but due to the insane 
amount of connections I can not say for certain that that's when the 
connection goes into CLOSED state. The ip's do match, but it's very 
common for the same ip's to make numerous connections too.

Kind Regards,

Ali


-- 
   Transip BV | http://www.transip.nl/
   We never let you down.