Date: Sat, 10 Nov 2007 02:13:22 -0600 (CST) From: Mike Silbersack <silby@silby.com> To: Matt Reimer <mattjreimer@gmail.com> Cc: net@freebsd.org Subject: Re: Should syncache.count ever be negative? Message-ID: <20071110020333.I46803@odysseus.silby.com> In-Reply-To: <f383264b0711092323p5148300fu3c0883135f8fb01b@mail.gmail.com> References: <f383264b0711091609n81875b6v444055960ab0fd96@mail.gmail.com> <20071109213846.O46803@odysseus.silby.com> <f383264b0711092323p5148300fu3c0883135f8fb01b@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 9 Nov 2007, Matt Reimer wrote: > Ok, I've run netperf in both directions. The box I've been targeting > is 66.230.193.105 aka wordpress1. Ok, at least that looks good. > The machine is a Dell 1950 with 8 x 1.6GHz Xeon 5310s, 8G RAM, and this NIC: Nice. > I first noticed this problem running ab; then to simplify I used > netrate/http[d]. What's strange is that it seems fine over the local > network (~15800 requests/sec), but it slowed down dramatically (~150 > req/sec) when tested from another network 20 ms away. Running systat > -tcp and nload I saw that there was an almost complete stall with only > a handful of packets being sent (probably my ssh packets) for a few > seconds or sometimes even up to 60 seconds or so. I think most benchmarking tools end up stalling if all of their threads stall, that may be why the rate falls off after the misbehavior you describe below begins. > Nov 9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to > [66.230.193.105]:80; syncache_socket: Socket create failed due to > limits or memory shortage > Nov 9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to > [66.230.193.105]:80 tcpflags 0x10<ACK>; tcp_input: Listen socket: > Socket allocation failed due to limits or memory shortage, sending RST Turns out you'll generally get both of those error messages together, from my reading of the code. Since you eliminated memory shortage in the socket zone, the next thing to check is the length of the listen queues. If the listen queue is backing up because the application isn't accepting fast enough, the errors above should happen. "netstat -Lan" should show you what's going on there. Upping the specified listen queue length in your webserver _may_ be all that is necessary. Try fiddling with that and watching how much they're filling up during testing. The fact that you see the same port repeatedly may indicate that the syncache isn't destroying the syncache entries when you get the socket creation failure. Take a look at "netstat -n" and look for SYN_RECEIVED entries - if they're sticking around for more than a few seconds, this is probably what's happening. (This entire paragraph is speculation, but worth investigating.) > I don't know if it's relevant, but accf_http is loaded on wordpress1. That may be relevant - accepting filtering changes how the listen queues are used. Try going back to non-accept filtering for now. > We have seen similar behavior (TCP slowdowns) on a different machines > (4 x Xeon 5160) with a different NIC (em0) running RELENG_7, though I > haven't diagnosed it to this level of detail. All our RELENG_6 and > RELENG_4 machines seem fine. em is the driver that I was having issues with when it shared an interrupt... :) FWIW, my crazy theory of the moment is this: We have some bug that happens when the listen queues overflow in 7.0, and your test is strenuous enough to hit the listen queue overflow condition, leading to total collapse. I'll have to cobble together a test program to see what happens in the listen queue overflow case. Thanks for the quick feedback, -Mike
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071110020333.I46803>