From owner-freebsd-stable@FreeBSD.ORG Thu May 29 21:32:45 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4512F1065682 for ; Thu, 29 May 2008 21:32:45 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id CE7D48FC21 for ; Thu, 29 May 2008 21:32:44 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m4TLWiiK026721; Thu, 29 May 2008 14:32:44 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m4TLWhCv026720; Thu, 29 May 2008 14:32:43 -0700 (PDT) Date: Thu, 29 May 2008 14:32:43 -0700 (PDT) From: Matthew Dillon Message-Id: <200805292132.m4TLWhCv026720@apollo.backplane.com> To: Robert Blayzor References: <1A19ABA2-61CD-4D92-A08D-5D9650D69768@mac.com> <23C02C8B-281A-4ABD-8144-3E25E36EDAB4@inoc.net> <483DE2E0.90003@FreeBSD.org> <483E36CE.3060400@FreeBSD.org> <483E3C26.3060103@paradise.net.nz> <483E4657.9060906@FreeBSD.org> <483EA513.4070409@earthlink.net> <96AFE8D3-7EAC-4A4A-8EFF-35A5DCEC6426@inoc.net> <483EAED1.2050404@FreeBSD.org> <200805291912.m4TJCG56025525@apollo.backplane.com> <14DA211A-A9C5-483A-8CB9-886E5B19A840@inoc.net> <200805291930.m4TJUeGX025815@apollo.backplane.com> <0C827F66-09CE-476D-86E9-146AB255926B@inoc.net> Cc: freebsd-stable@freebsd.org Subject: Re: Sockets stuck in FIN_WAIT_1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 May 2008 21:32:45 -0000 :I think we're onto something here, but for some reason it doesn't make :any sense. I have keepalives turned OFF in Apache: : :When I tcpdump this, I see something sending ack's back and forth :every 60 seconds, but what? Apache? I'm not sure why. I don't see :any timeouts in Apache for ~60 seconds. As you can see, sometimes we :send an ack, but never see a reply. I'm gathering the OS level :keepalives don't come into play because this session is not considered :idle? : : :0:13:07.640426 IP 1.1.1.1.80 > 2.2.2.2.33379: . :4208136508:4208136509(1) ack 1471446041 win 520 :20:13:07.736505 IP 2.2.2.2.33379 > 1.1.1.1.80: . ack 0 win 0 : :20:14:07.702647 IP 1.1.1.1.80 > 2.2.2.2.33379: . 0:1(1) ack 1 win 520 : :20:15:07.764920 IP 1.1.1.1.80 > 2.2.2.2.33379: . 0:1(1) ack 1 win 520 : :20:15:07.860988 IP 2.2.2.2.33379 > 1.1.1.1.80: . ack 0 win 0 : :20:16:07.827262 IP 1.1.1.1.80 > 2.2.2.2.33379: . 0:1(1) ack 1 win 520 :... Yah, the connection is valid so keepalives do not come into play. What is happening is that 1.1.1.1 wants to send something to 2.2.2.2, but 2.2.2.2 is telling 1.1.1.1 that it has no buffer space (win 0). This forces the TCP stack on 1.1.1.1 (the kernel, not the apache server) to 'probe' the connection, which it appears to be doing once a minute. It is probing the connection waiting for 2.2.2.2 to tell it that buffer space is available (win != 0). The connection remains valid because 2.2.2.2 continues to respond to the probes. Now, the connection is also in a half-closed state, which means that one direction is closed. I can't tell which direction that is but my guess is that 1.1.1.1 (the apache server) closed the 1.1.1.1->2.2.2.2 direction and the 2.2.2.2 box has a broken TCP implementation and can't deal with it. :I'm finding several of these sessions doing the same exact thing.... : :-- :Robert Blayzor, BOFH :INOC, LLC I can suggest two things. First, the TCP connection is good but you still may be able to tell Apache, in the apache configuration file, to timeout after a certain period of time and clear the connection. Secondly, it may be beneficial to identify exactly what the client and server were talking about which caused the client to hang with a live tcp connection. The only way to do that is to tcpdump EVERYTHING going on related to the apache srever, save it to a big-ass disk partition (like 500G), and then when you see a stuck connection go back through the tcpdump log file and locate it, grep it out, and review what exactly it was talking about. You'd have to tcpdump with options to tell it to dump the TCP data payloads. It seems likely that the client is running an applet or javascript that receives a stream over the connection, and that applet or javascript program has locked up, causing the data sent from the server to build up and for the client's buffer space to run out, and start advertising the 0 window. -Matt Matthew Dillon