Date: Wed, 6 Sep 2006 18:32:04 +0400 From: Gleb Smirnoff <glebius@FreeBSD.org> To: Mike Silbersack <silby@silby.com> Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, cvs-all@FreeBSD.org Subject: Re: cvs commit: src/sys/netinet in_pcb.c tcp_subr.c tcp_timer.c tcp_var.h Message-ID: <20060906143204.GQ40020@FreeBSD.org> In-Reply-To: <20060906091204.B6691@odysseus.silby.com> References: <200609061356.k86DuZ0w016069@repoman.freebsd.org> <20060906091204.B6691@odysseus.silby.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Mike, On Wed, Sep 06, 2006 at 09:16:03AM -0500, Mike Silbersack wrote: M> > Modified files: M> > sys/netinet in_pcb.c tcp_subr.c tcp_timer.c tcp_var.h M> > Log: M> > o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely M> > bad under high load. For example with 40k sockets and 25k tcptw M> > entries, connect() syscall can run for seconds. Debugging showed M> > that it iterates the cycle millions times and purges thousands of M> > tcptw entries at a time. M> > Besides practical unusability this change is architecturally M> > wrong. First, in_pcblookup_local() is used in connect() and bind() M> > syscalls. No stale entries purging shouldn't be done here. Second, M> > it is a layering violation. M> M> So you're returning to the behavior where the system chokes and stops all M> outbound TCP connections because everything is in the timewait state? M> There has to be a way to fix the problem without removing this heuristic M> entirely. M> M> How did you run your tests? Since we upgraded our web frontends from RELENG_4 to RELENG_6 half a year ago, we were noticing a small packet loss rate that was definitely a function from the server load. If we removed half of the frontends from the farm thus doubling the load, the lags could be measured in seconds, while having idle CPU time between lags. And our reference RELENG_4 box stand that load easily. First we suspected that this is some driver or lower network stack issue, we tried different hardware and different network settings - polling, no polling, direct ISR dispatch. Then we found the CPU hog in the in_pcblookup_local(). I've added counters and gathered stats via ktr(4). When a lag occured, the following data was gathered: 112350 return 0x0, iterations 0, expired 0 112349 return 0xc5154888, iterations 19998, expired 745 112348 return 0xc5154930, iterations 569, expired 20 112347 return 0xc51549d8, iterations 2084890, expired 9836 112346 return 0xc5154a80, iterations 9382, expired 524 112345 return 0xc5154bd0, iterations 64984631, expired 5501 The "iterations" counter counts number of iterations in this cycle: LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) The "expire" counts number of tcp_twclose() calls. So, for one connect() syscall the in_pcblookup_local() was called 5 times, each time doing a enormous amount of "work" inside. On the sixth time it succeded finding unused port. M> > o Return back the tcptw purging cycle to tcp_timer_2msl_tw(), M> > that was removed in rev. 1.78 by rwatson. The commit log of this M> > revision tells nothing about the reason cycle was removed. Now M> > we need this cycle, since major cleaner of stale tcptw structures M> > is removed. M> M> Looks good, this is probably the reason for the code in in_pcb behaving so M> poorly. Did you test just this change alone to see if it solved the M> problem that you were seeing? 1.78 hasn't yet been merged to RELENG_6, and we faced the problem on RELENG_6 boxes where the periodic merging cycle is present. So the problem is not in 1.78 of tcp_timer.c. We have a lot of tcptw entries because we have a very big connection rate, not because they are leaked or not purged. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060906143204.GQ40020>