From owner-freebsd-net@FreeBSD.ORG Thu Nov 11 14:37:59 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8D68106566B for ; Thu, 11 Nov 2010 14:37:59 +0000 (UTC) (envelope-from cpenney@gmail.com) Received: from mail-gx0-f182.google.com (mail-gx0-f182.google.com [209.85.161.182]) by mx1.freebsd.org (Postfix) with ESMTP id 62ADF8FC12 for ; Thu, 11 Nov 2010 14:37:59 +0000 (UTC) Received: by gxk9 with SMTP id 9so1232998gxk.13 for ; Thu, 11 Nov 2010 06:37:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=K1NfL3UzaLSF64Mu7okfJCGcupPcLXX8G/BrXUN6PKY=; b=NQdQF3x8fLuOEaase1oBMIbk+IENCHa6qDtf8UzHIiyEy7b/750gwOTXCAd3M1QHRA giCRU4LFnJFEJyWtdSRaXugH+nNkN6rGTZ+TpTqosmjj4mJ4xA2SN1enqxOltPdyH8q3 IWrpElYFmEinAC2fYJuJ8TbgbHi73vgfIlEEU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=CttBIVzWjr/feqgP+K8SCz0w1t4eKjb1wpoIZfpCcg4oih6KDfdJ7sCo/ROStgrfuE CzdCu36x9D67/mrikydGL2MhyKXc8TJHxRvqJihdixVcF16/VxmvRqozzDPYPv8k6IS5 aeTDpsP2ge/Akcq1At08wCi/F4oEQvbjH40ms= MIME-Version: 1.0 Received: by 10.91.10.21 with SMTP id n21mr1544773agi.75.1289486175029; Thu, 11 Nov 2010 06:36:15 -0800 (PST) Sender: cpenney@gmail.com Received: by 10.90.166.3 with HTTP; Thu, 11 Nov 2010 06:36:14 -0800 (PST) Date: Thu, 11 Nov 2010 09:36:14 -0500 X-Google-Sender-Auth: J8MT4Zp_qSFxSM1OT1-Mbs66LFU Message-ID: From: Christopher Penney To: freebsd-net@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: FreeBSD TCP Behavior with Linux NAT X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Nov 2010 14:37:59 -0000 Hi, I have a curious problem I'm hoping someone can help with or at least educate me on. I have several large Linux clusters and for each one we hide the compute nodes behind a head node using NAT. Historically, this has worked very well for us and any time a NAT gateway (the head node) reboots everything recovers within a minute or two of it coming back up. This includes NFS mounts from Linux and Solaris NFS servers, license server connections, etc. Recently, we added a FreeBSD based NFS server to our cluster resources and have had significant issues with NFS mounts hanging if the head node reboots. We don't have this happen much, but it does occasionally happen. I've explored this and it seems the behavior of FreeBSD differs a bit from at least Linux and Solaris with respect to TCP recovery. I'm curious if someone can explain this or offer any workarounds. Here are some specifics from a test I ran: Before the reboot two Linux clients were mounting the FreeBSD server. They were both using port 903 locally. On the head node clientA:903 was remapped to headnode:903 and clientB:903 was remapped to headnode:601. There is no activity when the reboot occurs. The head node takes a few minutes to come back up (we kept it down for several minutes). When it comes back up clientA and clientB try to reconnect to the FreeBSD NFS server. They both use the same source port, but since the head node's conntrack table is cleared it's a race to see who gets what port and this time clientA:903 appears as headnode:601 and clientB:903 appears as headnode:903 ( >>> they essentially switch places as far as the FreeBSD server would see <<< ). The FreeBSD NFS server, since there was no outstanding acks it was waiting on, thinks things are ok so when it gets a SYN from the two clients it only responds with an ACK. The ACK for each that it replies with is bogus (invalid seq number) because it's using the return path the other client was using before the reboot so the client sends a RST back, but it never gets to the FreeBSD system since the head node's NAT hasn't yet seen the full handshake (that would allow return packets). The end result is a "permanent" hang (at least until it would otherwise cleanup idle TCP connections). This is in stark contrast to the behavior of the other systems we have. Other systems respond to the SYN used to reconnect with a SYN/ACK. They appear to implicitly tear down the return path based on getting a SYN from a seemingly already established connection. I'm assuming this is one of the grey areas where there is no specific behavior outlined in an RFC? Is there any way to make the FreeBSD system more reliable in this situation (like making it implicitly tear down the return)? Or is there a way to adjust the NAT setup to allow the RST to return to the FreeBSD system? Currently, NAT is setup with simply: iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to 1.2.3.4 Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster network. Thanks! Chris