From owner-freebsd-stable@FreeBSD.ORG Fri May 4 01:26:31 2007 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E3E7116A404 for ; Fri, 4 May 2007 01:26:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id BDEED13C4AE for ; Fri, 4 May 2007 01:26:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.8/8.13.7) with ESMTP id l441QUbc078198; Thu, 3 May 2007 18:26:30 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.13.8/8.13.4/Submit) id l441QUZh078197; Thu, 3 May 2007 18:26:30 -0700 (PDT) Date: Thu, 3 May 2007 18:26:30 -0700 (PDT) From: Matthew Dillon Message-Id: <200705040126.l441QUZh078197@apollo.backplane.com> To: "Marc G. Fournier" References: Cc: freebsd-stable@freebsd.org, Robert Watson Subject: Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 May 2007 01:26:31 -0000 :I'm trying to probe this as well as I can, but network stacks and sockets have :never been my strong suit ... : :Robert had mentioned in one of his emails about a "Sockets can also exist :without any referencing process (if the application closes, but there is still :data draining on an open socket)." : :Now, that makes sense to me, I can understand that ... but, how would that look :as far as netstat -nA shows? Or, would it? For example, I have: : :... Netstat should show any sockets, whether they are attached to processes or not. Usually you can match up the address from netstat -nA with the addresses from sockets shown by fstat to figure out what processes the sockets are attached to. There are three situations that you have to watch out for: (1) The socket was close()'d and is still draining. The socket will timeout and terminate within ~1-5 minutes. It will not be referenced to a descriptor or process. (2) The socket descriptor itself has been sent over a unix domain socket from one process to another and is currently in transit. The file pointer representing the descriptor is what is actually in transit, and will not be referenced by any processes while it is in transit. There is a garbage collector that figures out unreferencable loops. I think its called unp_gc or something like that. (3) The socket is not closed, but is idle (like having a remote shell open and never typing in it). Service processes can get stuck waiting for data on such sockets. The socket WILL be referenced by some process. These are controlled by net.inet.tcp.keep* and net.inet.tcp.always_keepalive. I almost universally turn on net.inet.tcp.always_keepalive to ensure that dead idle connections get cleaned out. Note that keepalive only applies to idle connections. A socket that has been closed and needs to drain (either data or the FIN state) will timeout and clean up itself whether keepalive is turned on or off). netstat -nA will give you the status of all your sockets. You can observe the state of any TCP sockets. Unix domain sockets have no state and closure is governed simply by them being dereferenced, just like a pipe. In this case there are really only two situations: (1) One end of the unix domain socket is still referenced by a process or (2) The socket has been sent over another unix domain socket and is 'in transit'. The socket will remain intact until it is either no longer in transit (read out from the other unix domain socket), or the garbage collector determines that the socket the descripor is transiting over is not externally referencablee, and will destroy it and any in-transit sockets contained within. Any sockets that don't fall into these categories are in trouble... either a timer has failed somewhere or (if unix domain) the garbage collector has failed to detect that it is in an unreferencable loop. - One thing you can do is drop into single user mode... kill all the processes on the system, and see if the sockets are recovered. That will give you a good idea as to whether it is a real leak or whether some process is directly or indirectly (by not draining a unix domain socket on which other sockets are being transfered) holding onto the socket. -Matt