From owner-freebsd-net@FreeBSD.ORG Wed Jun 25 21:47:53 2008 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 902E0106567B for ; Wed, 25 Jun 2008 21:47:53 +0000 (UTC) (envelope-from freebsd-net@transip.nl) Received: from relay0.transip.nl (relay0.transip.nl [80.69.67.21]) by mx1.freebsd.org (Postfix) with ESMTP id 42C988FC1E for ; Wed, 25 Jun 2008 21:47:53 +0000 (UTC) (envelope-from freebsd-net@transip.nl) Received: from [192.168.0.3] (ip86-50-212-87.adsl2.static.versatel.nl [87.212.50.86]) by relay0.transip.nl (Postfix) with ESMTP id 54BDE1036BA; Wed, 25 Jun 2008 23:47:49 +0200 (CEST) Message-ID: <4862BCF5.4070900@transip.nl> Date: Wed, 25 Jun 2008 23:47:33 +0200 From: Ali Niknam Organization: Transip BV User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: Robert Watson References: <486283B0.3060805@transip.nl> <20080625195523.N29013@fledge.watson.org> In-Reply-To: <20080625195523.N29013@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: net@freebsd.org Subject: Re: FreeBSD 7.0: sockets stuck in CLOSED state... X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jun 2008 21:47:53 -0000 Hi Robert, > Sounds like there's a bug somewhere. Before we start trying to track it [...] > So, with that introduction, we're interested in resolving: > Quite comprehensive indeed; thank you for all that information. I was not aware that there was a decoupling between the various parts of the abstractions, but now that I think of it, it's more or less logical I guess. > The first is the easiest to resolve, as all we need to do is see whether [...] > the file descriptor numbers being returned to see whether, perhaps, that > number only goes up over time, and gets really big. > My personal feeling is that it's a race condition; no idea why, but it feels that way. Maybe because it's such a small number as compared to the big amount of connections that takes place. I do not leak file descriptors as far as I can see, I can send you the information you ask for (netstat, sockstat, fstat, etc.) offlist if you like, or if you prefer, I can give you access to the machine, please let me know whichever you like. I'd like to reiterate that at this moment i'm not sure at all if it's my code, or kernel code. However I've seen, for my feeling, sufficient information to reasonably suspect that it _might_ be something outside my code :). > wedged-up state. It would be most helpful if you could actually shut > down to single-user mode, killing all user processes, then waiting ten > minutes, and capturing the output of those above commands to files that > you can then e-mail to me. > Because it's a live machine that would be very difficult. Maybe, if you really really need it that way and we can't find another way I can announce maintainance and do it in the middle of the night :). > Without accusing you of having buggy code, I should say that I think > there's a reasonable chance that what you're seeing is an interaction > between an existing leak of resources in the application and the way the > kernel state management has changed. The output from netstat pretty Yes that was the first thing I though of as well, however, especially one of the two applications is so simple that I would be ashamed to death if I still had a bug in there :). If it turns out that way: sssstttt ;). > precisely matches that what you'd expect: lots of TCP connections in the > CLOSED state reflecting a series of connections built by an application > but then not properly discarded. Likewise, when the application is > killed, all of the connections go away -- most likely because the file > descriptors are all closed, allowing them to be garbage collected and > connection state freed. If it is this sort of bug, then most likely > you're missing a call to close() in a work loop somewhere, and in some > exceptional case, you fall out of the loop without calling close(). > I will double check this once more, but honestly, i strongly doubt it... Also one other thing that I've noticed, is that it's always the input buffer that has bytes left; never the output buffer... Moreover, i've seen that close() reports EBADF, but due to the insane amount of connections I can not say for certain that that's when the connection goes into CLOSED state. The ip's do match, but it's very common for the same ip's to make numerous connections too. Kind Regards, Ali -- Transip BV | http://www.transip.nl/ We never let you down.