From owner-freebsd-net@FreeBSD.ORG Wed Jun 13 12:24:44 2007 Return-Path: X-Original-To: freebsd-net@freebsd.org Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 92BDA16A469 for ; Wed, 13 Jun 2007 12:24:44 +0000 (UTC) (envelope-from wmoran@collaborativefusion.com) Received: from mx00.pub.collaborativefusion.com (mx00.pub.collaborativefusion.com [206.210.89.199]) by mx1.freebsd.org (Postfix) with ESMTP id 4CD4713C468 for ; Wed, 13 Jun 2007 12:24:44 +0000 (UTC) (envelope-from wmoran@collaborativefusion.com) Received: from vanquish.pgh.priv.collaborativefusion.com (vanquish.pgh.priv.collaborativefusion.com [192.168.2.61]) (SSL: TLSv1/SSLv3,256bits,AES256-SHA) by wingspan with esmtp; Wed, 13 Jun 2007 08:24:43 -0400 id 0005641D.466FE20B.00014F9C Date: Wed, 13 Jun 2007 08:24:43 -0400 From: Bill Moran To: Adam McDougall Message-Id: <20070613082443.80d54fd1.wmoran@collaborativefusion.com> In-Reply-To: <20070612180349.GN23144@egr.msu.edu> References: <20070612101949.646dcaa5.wmoran@collaborativefusion.com> <20070612180349.GN23144@egr.msu.edu> Organization: Collaborative Fusion X-Mailer: Sylpheed 2.3.1 (GTK+ 2.10.11; i386-portbld-freebsd6.1) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org Subject: Re: Weird "ignoring syn" problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jun 2007 12:24:44 -0000 In response to Adam McDougall : > On Tue, Jun 12, 2007 at 10:19:49AM -0400, Bill Moran wrote: > > > This one has got me pretty befuddled. > > We're seeing some really odd behaviour with FreeBSD ignoring SYN packets. > I've been trying to diagnose this for a couple of weeks now, and my current > guess is that there's something wrong with the em driver. Here's a narrowed > down list of what I've ruled out: > *) I've done my best to eliminate other network components as the problem. > My theory at this point is that it can't possibly be any other network > hardware, based on the tcpdump show below. > *) The problem occurred on both FreeBSD 6.1 and FreeBSD 6.2-p3. > *) The problem does not appear to be tied to CPU usage -- the CPU is nearly > idle when the problem occurs. > *) I can now reproduce it pretty easily, so I'll know when it's fixed. > *) The system exhibiting the problem is running 15 jails, but they are > idle 95% of the time. The problem initially occurred inside one of > the jails, but I just recreated it outside the jail (on the host) and > it's _easier_ to reproduce outside the jail. > *) The problem occurred with both GENERIC, and the SMP kernel (this is a > dual-CPU, hyperthreaded system) > *) I've tested and the behavior occurs both with a dynamically generated > file (from PHP) or from a static file. > > The nature of the beast is that we've got a SOAP application running under > Apache and PHP. This application is subject to many requests in rapid > succession, such that load can be simulated by the following loop: > > while true; do fetch http://192.168.121.250/test.php; done > > The problem is that occasionally, the Apache server machine just ignores > SYN packets. Take the following tcpdump output for example: > > 13:34:17.312296 IP web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 2645061726:2645061726(0) win 65535 > 13:34:20.312398 IP web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 2645061726:2645061726(0) win 65535 > 13:34:23.512626 IP web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 2645061726:2645061726(0) win 65535 > > This is the _only_ traffic on port 80 during the test. It looks like the > kernel has ignored the initial syn packet and two duplicates. I've seen it > take as long as 45 seconds to establish a connection, and this causes > ugly performance problems, as well as frequent timeouts on the client end. > The only clue I've found so far is this output from netstat -s. > > > Does the Apache server have a firewall of any sort? (Could be making unexpected > decisions there, even not part of a fw rule) > > Try net.inet.ip.portrange.randomized=0 on the client? (If this is the problem, > we would probably see a reused port if you had a tcpdump of a few minutes > if started after waiting for several minutes of "silence") > > Are both systems on the same subnet? If not, can/have you tried that? No, they aren't. My ability to test on the same subnet is limited and the results inconclusive. > Can you show tcpdump output using -e on the requests that aren't answered > as well as an example that IS answered? (I have seen routers mess up the MAC > addresses for the source and destination and if I kept staring at layer 3 > data all day I might never have seen the problem) > > Better yet, can you post files containing tcpdump output using -w of an entire > session that ideally contains failed attempts that eventually work? That way > people could look at a broader picture and perhaps pick up on something subtle. > Its worth comparing a SYN that works, directly with a SYN that doesn't work. We've decided to swap the card out on Friday and see if that resolves the problem. We have similar units that don't exhibit the problem, so I'm getting pretty suspicious that this might be a flaky NIC. If the new card doesn't solve the problem, I'll post more details on Monday. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023