From owner-freebsd-net@FreeBSD.ORG Fri Nov 12 02:29:16 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D8DD7106564A; Fri, 12 Nov 2010 02:29:16 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 4857F8FC1D; Fri, 12 Nov 2010 02:29:16 +0000 (UTC) Received: from lawrence1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id 2B09C7E84A; Fri, 12 Nov 2010 13:29:14 +1100 (EST) Message-ID: <4CDCA679.7020401@freebsd.org> Date: Fri, 12 Nov 2010 13:29:13 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU; rv:1.9.2.9) Gecko/20101006 Lightning/1.0b2 Thunderbird/3.1.4 MIME-Version: 1.0 To: Julian Elischer References: <4CDC5490.7030109@freebsd.org> In-Reply-To: <4CDC5490.7030109@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=T_FRT_CONTACT, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net Cc: freebsd-net@freebsd.org, Christopher Penney , Andre Oppermann Subject: Re: NFS + FreeBSD TCP Behavior with Linux NAT X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Nov 2010 02:29:17 -0000 On 11/12/10 07:39, Julian Elischer wrote: > On 11/11/10 6:36 AM, Christopher Penney wrote: >> Hi, >> >> I have a curious problem I'm hoping someone can help with or at least >> educate me on. >> >> I have several large Linux clusters and for each one we hide the compute >> nodes behind a head node using NAT. Historically, this has worked >> very well >> for us and any time a NAT gateway (the head node) reboots everything >> recovers within a minute or two of it coming back up. This includes NFS >> mounts from Linux and Solaris NFS servers, license server connections, >> etc. >> >> Recently, we added a FreeBSD based NFS server to our cluster resources >> and >> have had significant issues with NFS mounts hanging if the head node >> reboots. We don't have this happen much, but it does occasionally >> happen. >> I've explored this and it seems the behavior of FreeBSD differs a >> bit from >> at least Linux and Solaris with respect to TCP recovery. I'm curious if >> someone can explain this or offer any workarounds. >> >> Here are some specifics from a test I ran: >> >> Before the reboot two Linux clients were mounting the FreeBSD server. >> They >> were both using port 903 locally. On the head node clientA:903 was >> remapped >> to headnode:903 and clientB:903 was remapped to headnode:601. There >> is no >> activity when the reboot occurs. The head node takes a few minutes to >> come >> back up (we kept it down for several minutes). >> >> When it comes back up clientA and clientB try to reconnect to the FreeBSD >> NFS server. They both use the same source port, but since the head >> node's >> conntrack table is cleared it's a race to see who gets what port and this >> time clientA:903 appears as headnode:601 and clientB:903 appears as >> headnode:903 (>>> they essentially switch places as far as the FreeBSD >> server would see<<< ). >> >> The FreeBSD NFS server, since there was no outstanding acks it was >> waiting >> on, thinks things are ok so when it gets a SYN from the two clients it >> only >> responds with an ACK. The ACK for each that it replies with is bogus >> (invalid seq number) because it's using the return path the other >> client was >> using before the reboot so the client sends a RST back, but it never >> gets to >> the FreeBSD system since the head node's NAT hasn't yet seen the full >> handshake (that would allow return packets). The end result is a >> "permanent" hang (at least until it would otherwise cleanup idle TCP >> connections). >> >> This is in stark contrast to the behavior of the other systems we have. >> Other systems respond to the SYN used to reconnect with a SYN/ACK. >> They >> appear to implicitly tear down the return path based on getting a SYN >> from a >> seemingly already established connection. >> >> I'm assuming this is one of the grey areas where there is no specific >> behavior outlined in an RFC? Is there any way to make the FreeBSD system >> more reliable in this situation (like making it implicitly tear down the >> return)? Or is there a way to adjust the NAT setup to allow the RST to >> return to the FreeBSD system? Currently, NAT is setup with simply: >> >> iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to >> 1.2.3.4 >> >> Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster >> network. > > I just added NFS to the subject because the NFS people are thise you > need to > connect with. Skimming Chris' problem description, I don't think I agree that this is an NFS issue and agree with Chris that it's netstack related behaviour as opposed to application related. Chris, I have minimal cycles at the moment and your scenario is bending my brain a little bit too much to give a quick response. A tcpdump excerpt showing such an exchange would be very useful. I'll try come back to it when I I have a sec. Andre, do you have a few cycles to digest this in more detail? Cheers, Lawrence