From owner-freebsd-net@FreeBSD.ORG  Fri Nov 12 02:29:16 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D8DD7106564A;
	Fri, 12 Nov 2010 02:29:16 +0000 (UTC)
	(envelope-from lstewart@freebsd.org)
Received: from lauren.room52.net (lauren.room52.net [210.50.193.198])
	by mx1.freebsd.org (Postfix) with ESMTP id 4857F8FC1D;
	Fri, 12 Nov 2010 02:29:16 +0000 (UTC)
Received: from lawrence1.loshell.room52.net
	(ppp59-167-184-191.static.internode.on.net [59.167.184.191])
	by lauren.room52.net (Postfix) with ESMTPSA id 2B09C7E84A;
	Fri, 12 Nov 2010 13:29:14 +1100 (EST)
Message-ID: <4CDCA679.7020401@freebsd.org>
Date: Fri, 12 Nov 2010 13:29:13 +1100
From: Lawrence Stewart <lstewart@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU;
	rv:1.9.2.9) Gecko/20101006 Lightning/1.0b2 Thunderbird/3.1.4
MIME-Version: 1.0
To: Julian Elischer <julian@freebsd.org>
References: <AANLkTikmpXDsi9N36D+M1ZFfyNGAZ3A-asaTNm5U7PwK@mail.gmail.com>
	<4CDC5490.7030109@freebsd.org>
In-Reply-To: <4CDC5490.7030109@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=5.0 tests=T_FRT_CONTACT,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net
Cc: freebsd-net@freebsd.org, Christopher Penney <penney@msu.edu>,
	Andre Oppermann <andre@FreeBSD.org>
Subject: Re: NFS + FreeBSD TCP Behavior with Linux NAT
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Nov 2010 02:29:17 -0000

On 11/12/10 07:39, Julian Elischer wrote:
> On 11/11/10 6:36 AM, Christopher Penney wrote:
>> Hi,
>>
>> I have a curious problem I'm hoping someone can help with or at least
>> educate me on.
>>
>> I have several large Linux clusters and for each one we hide the compute
>> nodes behind a head node using NAT.  Historically, this has worked
>> very well
>> for us and any time a NAT gateway (the head node) reboots everything
>> recovers within a minute or two of it coming back up.  This includes NFS
>> mounts from Linux and Solaris NFS servers, license server connections,
>> etc.
>>
>> Recently, we added a FreeBSD based NFS server to our cluster resources
>> and
>> have had significant issues with NFS mounts hanging if the head node
>> reboots.  We don't have this happen much, but it does occasionally
>> happen.
>>   I've explored this and it seems the behavior of FreeBSD differs a
>> bit from
>> at least Linux and Solaris with respect to TCP recovery.  I'm curious if
>> someone can explain this or offer any workarounds.
>>
>> Here are some specifics from a test I ran:
>>
>> Before the reboot two Linux clients were mounting the FreeBSD server. 
>> They
>> were both using port 903 locally.  On the head node clientA:903 was
>> remapped
>> to headnode:903 and clientB:903 was remapped to headnode:601.  There
>> is no
>> activity when the reboot occurs.  The head node takes a few minutes to
>> come
>> back up (we kept it down for several minutes).
>>
>> When it comes back up clientA and clientB try to reconnect to the FreeBSD
>> NFS server.  They both use the same source port, but since the head
>> node's
>> conntrack table is cleared it's a race to see who gets what port and this
>> time clientA:903 appears as headnode:601 and clientB:903 appears as
>> headnode:903 (>>>  they essentially switch places as far as the FreeBSD
>> server would see<<<  ).
>>
>> The FreeBSD NFS server, since there was no outstanding acks it was
>> waiting
>> on, thinks things are ok so when it gets a SYN from the two clients it
>> only
>> responds with an ACK.  The ACK for each that it replies with is bogus
>> (invalid seq number) because it's using the return path the other
>> client was
>> using before the reboot so the client sends a RST back, but it never
>> gets to
>> the FreeBSD system since the head node's NAT hasn't yet seen the full
>> handshake (that would allow return packets).  The end result is a
>> "permanent" hang (at least until it would otherwise cleanup idle TCP
>> connections).
>>
>> This is in stark contrast to the behavior of the other systems we have.
>>   Other systems respond to the SYN used to reconnect with a SYN/ACK. 
>> They
>> appear to implicitly tear down the return path based on getting a SYN
>> from a
>> seemingly already established connection.
>>
>> I'm assuming this is one of the grey areas where there is no specific
>> behavior outlined in an RFC?  Is there any way to make the FreeBSD system
>> more reliable in this situation (like making it implicitly tear down the
>> return)?  Or is there a way to adjust the NAT setup to allow the RST to
>> return to the FreeBSD system?  Currently, NAT is setup with simply:
>>
>> iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to
>> 1.2.3.4
>>
>> Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster
>> network.
> 
> I just added NFS to the subject because the NFS people are thise you
> need to
> connect with.

Skimming Chris' problem description, I don't think I agree that this is
an NFS issue and agree with Chris that it's netstack related behaviour
as opposed to application related.

Chris, I have minimal cycles at the moment and your scenario is bending
my brain a little bit too much to give a quick response. A tcpdump
excerpt showing such an exchange would be very useful. I'll try come
back to it when I I have a sec. Andre, do you have a few cycles to
digest this in more detail?

Cheers,
Lawrence