From owner-freebsd-net@FreeBSD.ORG Mon Jun 25 21:13:55 2007 Return-Path: X-Original-To: freebsd-net@freebsd.org Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E42BC16A46F for ; Mon, 25 Jun 2007 21:13:54 +0000 (UTC) (envelope-from dave@dogwood.com) Received: from nz-out-0506.google.com (nz-out-0506.google.com [64.233.162.238]) by mx1.freebsd.org (Postfix) with ESMTP id 96EFA13C487 for ; Mon, 25 Jun 2007 21:13:54 +0000 (UTC) (envelope-from dave@dogwood.com) Received: by nz-out-0506.google.com with SMTP id 34so618513nzf for ; Mon, 25 Jun 2007 14:13:54 -0700 (PDT) Received: by 10.114.201.1 with SMTP id y1mr5789134waf.1182806033661; Mon, 25 Jun 2007 14:13:53 -0700 (PDT) Received: from Gecko.dogwood.com ( [66.175.65.65]) by mx.google.com with ESMTP id m30sm5656686wag.2007.06.25.14.13.50 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Jun 2007 14:13:52 -0700 (PDT) X-Mailer: QUALCOMM Windows Eudora Version 7.1.0.9 Date: Mon, 25 Jun 2007 11:13:43 -1000 To: Bill Moran ,freebsd-net@freebsd.org From: David Cornejo In-Reply-To: <20070625142740.3b6964c0.wmoran@collaborativefusion.com> References: <20070612101949.646dcaa5.wmoran@collaborativefusion.com> <20070612180349.GN23144@egr.msu.edu> <20070613082443.80d54fd1.wmoran@collaborativefusion.com> <20070625142740.3b6964c0.wmoran@collaborativefusion.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed Message-ID: <46803010.1ed6720a.6149.3082@mx.google.com> Cc: Subject: Re: Weird "ignoring syn" problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Jun 2007 21:13:55 -0000 At 08:27 AM 6/25/2007, Bill Moran wrote: >In response to Bill Moran : > > > In response to Adam McDougall : > > > > > On Tue, Jun 12, 2007 at 10:19:49AM -0400, Bill Moran wrote: > > > > > > > > > This one has got me pretty befuddled. > > > > > > We're seeing some really odd behaviour with FreeBSD ignoring > SYN packets. > > > I've been trying to diagnose this for a couple of weeks now, > and my current > > > guess is that there's something wrong with the em > driver. Here's a narrowed > > > down list of what I've ruled out: > > > *) I've done my best to eliminate other network components as > the problem. > > > My theory at this point is that it can't possibly be any > other network > > > hardware, based on the tcpdump show below. > > > *) The problem occurred on both FreeBSD 6.1 and FreeBSD 6.2-p3. > > > *) The problem does not appear to be tied to CPU usage -- the > CPU is nearly > > > idle when the problem occurs. > > > *) I can now reproduce it pretty easily, so I'll know when it's fixed. > > > *) The system exhibiting the problem is running 15 jails, but they are > > > idle 95% of the time. The problem initially occurred inside one of > > > the jails, but I just recreated it outside the jail (on > the host) and > > > it's _easier_ to reproduce outside the jail. > > > *) The problem occurred with both GENERIC, and the SMP kernel > (this is a > > > dual-CPU, hyperthreaded system) > > > *) I've tested and the behavior occurs both with a > dynamically generated > > > file (from PHP) or from a static file. > > > > > > The nature of the beast is that we've got a SOAP application > running under > > > Apache and PHP. This application is subject to many requests in rapid > > > succession, such that load can be simulated by the following loop: > > > > > > while true; do fetch http://192.168.121.250/test.php; done > > > > > > The problem is that occasionally, the Apache server machine > just ignores > > > SYN packets. Take the following tcpdump output for example: > > > > > > 13:34:17.312296 IP > web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S > 2645061726:2645061726(0) win 65535 1,nop,nop,timestamp 2690201156 0,sackOK,eol> > > > 13:34:20.312398 IP > web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S > 2645061726:2645061726(0) win 65535 1,nop,nop,timestamp 2690204156 0,sackOK,eol> > > > 13:34:23.512626 IP > web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S > 2645061726:2645061726(0) win 65535 1,nop,nop,timestamp 2690207356 0,sackOK,eol> > > > > > > This is the _only_ traffic on port 80 during the test. It > looks like the > > > kernel has ignored the initial syn packet and two > duplicates. I've seen it > > > take as long as 45 seconds to establish a connection, and this causes > > > ugly performance problems, as well as frequent timeouts on > the client end. > > > The only clue I've found so far is this output from netstat -s. > > > > > > > > > Does the Apache server have a firewall of any sort? (Could be > making unexpected > > > decisions there, even not part of a fw rule) > > > > > > Try net.inet.ip.portrange.randomized=0 on the client? (If this > is the problem, > > > we would probably see a reused port if you had a tcpdump of a few minutes > > > if started after waiting for several minutes of "silence") > > > > > > Are both systems on the same subnet? If not, can/have you tried that? > > > > No, they aren't. My ability to test on the same subnet is limited and > > the results inconclusive. > > > > > Can you show tcpdump output using -e on the requests that aren't answered > > > as well as an example that IS answered? (I have seen routers > mess up the MAC > > > addresses for the source and destination and if I kept staring at layer 3 > > > data all day I might never have seen the problem) > > > > > > Better yet, can you post files containing tcpdump output using > -w of an entire > > > session that ideally contains failed attempts that eventually > work? That way > > > people could look at a broader picture and perhaps pick up on > something subtle. > > > Its worth comparing a SYN that works, directly with a SYN that > doesn't work. > > > > We've decided to swap the card out on Friday and see if that resolves the > > problem. We have similar units that don't exhibit the problem, so I'm > > getting pretty suspicious that this might be a flaky NIC. If the new > > card doesn't solve the problem, I'll post more details on Monday. > >Just in case someone was curious as to the result, or finds this on a web >search. > >The behaviour was apparently hardware related. We swapped the NIC out and >can no longer reproduce the problem. To follow up on my situation - Over the weekend I took the Soekris box that demonstrated the bad TCP checksums and wiped then reinstalled the same vintage CURRENT and the problem disappeared. I used the same kernel config in both cases. Thanks to those who replied... dave c