From owner-freebsd-net@FreeBSD.ORG Thu Nov 11 20:40:00 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 579DC106566C for ; Thu, 11 Nov 2010 20:40:00 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from out-0.mx.aerioconnect.net (outg.internet-mail-service.net [216.240.47.230]) by mx1.freebsd.org (Postfix) with ESMTP id 387F28FC15 for ; Thu, 11 Nov 2010 20:39:59 +0000 (UTC) Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160]) by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id oABKdwm2026332; Thu, 11 Nov 2010 12:39:58 -0800 X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id CE04F2D6017; Thu, 11 Nov 2010 12:39:44 -0800 (PST) Message-ID: <4CDC5490.7030109@freebsd.org> Date: Thu, 11 Nov 2010 12:39:44 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6 MIME-Version: 1.0 To: Christopher Penney References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51 Cc: freebsd-net@freebsd.org Subject: Re: NFS + FreeBSD TCP Behavior with Linux NAT X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Nov 2010 20:40:00 -0000 On 11/11/10 6:36 AM, Christopher Penney wrote: > Hi, > > I have a curious problem I'm hoping someone can help with or at least > educate me on. > > I have several large Linux clusters and for each one we hide the compute > nodes behind a head node using NAT. Historically, this has worked very well > for us and any time a NAT gateway (the head node) reboots everything > recovers within a minute or two of it coming back up. This includes NFS > mounts from Linux and Solaris NFS servers, license server connections, etc. > > Recently, we added a FreeBSD based NFS server to our cluster resources and > have had significant issues with NFS mounts hanging if the head node > reboots. We don't have this happen much, but it does occasionally happen. > I've explored this and it seems the behavior of FreeBSD differs a bit from > at least Linux and Solaris with respect to TCP recovery. I'm curious if > someone can explain this or offer any workarounds. > > Here are some specifics from a test I ran: > > Before the reboot two Linux clients were mounting the FreeBSD server. They > were both using port 903 locally. On the head node clientA:903 was remapped > to headnode:903 and clientB:903 was remapped to headnode:601. There is no > activity when the reboot occurs. The head node takes a few minutes to come > back up (we kept it down for several minutes). > > When it comes back up clientA and clientB try to reconnect to the FreeBSD > NFS server. They both use the same source port, but since the head node's > conntrack table is cleared it's a race to see who gets what port and this > time clientA:903 appears as headnode:601 and clientB:903 appears as > headnode:903 (>>> they essentially switch places as far as the FreeBSD > server would see<<< ). > > The FreeBSD NFS server, since there was no outstanding acks it was waiting > on, thinks things are ok so when it gets a SYN from the two clients it only > responds with an ACK. The ACK for each that it replies with is bogus > (invalid seq number) because it's using the return path the other client was > using before the reboot so the client sends a RST back, but it never gets to > the FreeBSD system since the head node's NAT hasn't yet seen the full > handshake (that would allow return packets). The end result is a > "permanent" hang (at least until it would otherwise cleanup idle TCP > connections). > > This is in stark contrast to the behavior of the other systems we have. > Other systems respond to the SYN used to reconnect with a SYN/ACK. They > appear to implicitly tear down the return path based on getting a SYN from a > seemingly already established connection. > > I'm assuming this is one of the grey areas where there is no specific > behavior outlined in an RFC? Is there any way to make the FreeBSD system > more reliable in this situation (like making it implicitly tear down the > return)? Or is there a way to adjust the NAT setup to allow the RST to > return to the FreeBSD system? Currently, NAT is setup with simply: > > iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to 1.2.3.4 > > Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster network. I just added NFS to the subject because the NFS people are thise you need to connect with. > Thanks! > > Chris > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >