From owner-freebsd-net@FreeBSD.ORG Fri Nov 12 06:26:21 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 411361065674 for ; Fri, 12 Nov 2010 06:26:21 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id B57AE8FC13 for ; Fri, 12 Nov 2010 06:26:20 +0000 (UTC) Received: (qmail 21049 invoked from network); 12 Nov 2010 06:10:18 -0000 Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 12 Nov 2010 06:10:18 -0000 Message-ID: <4CDCDE0F.8010501@freebsd.org> Date: Fri, 12 Nov 2010 07:26:23 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6 MIME-Version: 1.0 To: Lawrence Stewart References: <4CDC5490.7030109@freebsd.org> <4CDCA679.7020401@freebsd.org> In-Reply-To: <4CDCA679.7020401@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org, Christopher Penney Subject: Re: NFS + FreeBSD TCP Behavior with Linux NAT X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Nov 2010 06:26:21 -0000 On 12.11.2010 03:29, Lawrence Stewart wrote: > On 11/12/10 07:39, Julian Elischer wrote: >> On 11/11/10 6:36 AM, Christopher Penney wrote: >>> Hi, >>> >>> I have a curious problem I'm hoping someone can help with or at least >>> educate me on. >>> >>> I have several large Linux clusters and for each one we hide the compute >>> nodes behind a head node using NAT. Historically, this has worked >>> very well >>> for us and any time a NAT gateway (the head node) reboots everything >>> recovers within a minute or two of it coming back up. This includes NFS >>> mounts from Linux and Solaris NFS servers, license server connections, >>> etc. >>> >>> Recently, we added a FreeBSD based NFS server to our cluster resources >>> and >>> have had significant issues with NFS mounts hanging if the head node >>> reboots. We don't have this happen much, but it does occasionally >>> happen. >>> I've explored this and it seems the behavior of FreeBSD differs a >>> bit from >>> at least Linux and Solaris with respect to TCP recovery. I'm curious if >>> someone can explain this or offer any workarounds. >>> >>> Here are some specifics from a test I ran: >>> >>> Before the reboot two Linux clients were mounting the FreeBSD server. >>> They >>> were both using port 903 locally. On the head node clientA:903 was >>> remapped >>> to headnode:903 and clientB:903 was remapped to headnode:601. There >>> is no >>> activity when the reboot occurs. The head node takes a few minutes to >>> come >>> back up (we kept it down for several minutes). >>> >>> When it comes back up clientA and clientB try to reconnect to the FreeBSD >>> NFS server. They both use the same source port, but since the head >>> node's >>> conntrack table is cleared it's a race to see who gets what port and this >>> time clientA:903 appears as headnode:601 and clientB:903 appears as >>> headnode:903 (>>> they essentially switch places as far as the FreeBSD >>> server would see<<< ). >>> >>> The FreeBSD NFS server, since there was no outstanding acks it was >>> waiting >>> on, thinks things are ok so when it gets a SYN from the two clients it >>> only >>> responds with an ACK. The ACK for each that it replies with is bogus >>> (invalid seq number) because it's using the return path the other >>> client was >>> using before the reboot so the client sends a RST back, but it never >>> gets to >>> the FreeBSD system since the head node's NAT hasn't yet seen the full >>> handshake (that would allow return packets). The end result is a >>> "permanent" hang (at least until it would otherwise cleanup idle TCP >>> connections). >>> >>> This is in stark contrast to the behavior of the other systems we have. >>> Other systems respond to the SYN used to reconnect with a SYN/ACK. >>> They >>> appear to implicitly tear down the return path based on getting a SYN >>> from a >>> seemingly already established connection. >>> >>> I'm assuming this is one of the grey areas where there is no specific >>> behavior outlined in an RFC? Is there any way to make the FreeBSD system >>> more reliable in this situation (like making it implicitly tear down the >>> return)? Or is there a way to adjust the NAT setup to allow the RST to >>> return to the FreeBSD system? Currently, NAT is setup with simply: >>> >>> iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to >>> 1.2.3.4 >>> >>> Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster >>> network. >> >> I just added NFS to the subject because the NFS people are thise you >> need to >> connect with. > > Skimming Chris' problem description, I don't think I agree that this is > an NFS issue and agree with Chris that it's netstack related behaviour > as opposed to application related. > > Chris, I have minimal cycles at the moment and your scenario is bending > my brain a little bit too much to give a quick response. A tcpdump > excerpt showing such an exchange would be very useful. I'll try come > back to it when I I have a sec. Andre, do you have a few cycles to > digest this in more detail? I had very few cycles since EuroBSDCon as well but this weekend my little son has a sleep over at my mother in law and my wife is at work. So I'm going to reduce my FreeBSD backlog. There are a few things that have queued up. I should get enough time to take care of this one. -- Andre