Date: Sun, 21 Mar 2010 10:15:58 +0000 (GMT) From: Robert Watson <rwatson@FreeBSD.org> To: Kip Macy <kmacy@FreeBSD.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r205391 - head/sys/netinet Message-ID: <alpine.BSF.2.00.1003210953320.80510@fledge.watson.org> In-Reply-To: <201003201947.o2KJlUUA070546@svn.freebsd.org> References: <201003201947.o2KJlUUA070546@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 20 Mar 2010, Kip Macy wrote: > - spread tcp timer callout load evenly across cpus if net.inet.tcp.per_cpu_timers is set to 1 > - don't default to acquiring tcbinfo lock exclusively in rexmt > > MFC after: 7 days In the future, it would be helpful if you could make independent changes such as these as separate commits. It makes post-commit review easier, but also means that elements of the change can be more easily backed out or merged separately. My experience with TCP timer and locking changes is that 7 days is a very short merge time. I would generally suggest a minimum of a month -- while bugs crop up more slowly in HEAD testing, the bug reports are better and you're more likely to get people with expertise looking at them there (without the risk of taking out large numbers of systems :-). This is especially true currently, when several folks actively have their hands in TCP. For more complex changes, I've generally been going with a three month MFC timeout. > +static int per_cpu_timers = 0; > +SYSCTL_INT(_net_inet_tcp, OID_AUTO, per_cpu_timers, CTLFLAG_RW, > + &per_cpu_timers , 0, "run tcp timers on all cpus"); > + > +#define INP_CPU(inp) (per_cpu_timers ? (!CPU_ABSENT(((inp)->inp_flowid % (mp_maxid+1))) ? \ > + ((inp)->inp_flowid % (mp_maxid+1)) : curcpu) : 0) The 'curcpu' case here violates an invariant we have been trying hard to maintain: that callouts for a single inpcb/tcpcb execute on only one CPU at a time. While I don't have any specific bugs in mind (well, perhaps other than the well-known TCP timer race that turns out does occur with moderate frequency), it is fairly certain that increasing single connection parallelism in callouts would significantly increase the chances of hitting those bugs/features. Long and hard experience suggests that changing assumptions in the TCP timer code can have subtle but ultimately catastrophic consequences (both Andre and I have run into this in the past decade). Since the 'curcpu' case above is somewhat unlikely in the hardware you're using, can I suggest changing it to fall back to CPU 0 in that case as well? This would maintain the parallelism you're trying to accomplish but avoid that edge case that could have hard to to track down consequences. This would increase my comfort level with an MFC before 8.1. > @@ -478,11 +485,22 @@ tcp_timer_rexmt(void * xtp) > if (++tp->t_rxtshift > TCP_MAXRXTSHIFT) { > tp->t_rxtshift = TCP_MAXRXTSHIFT; > TCPSTAT_INC(tcps_timeoutdrop); > + in_pcbref(inp); > + INP_INFO_RUNLOCK(&V_tcbinfo); > + INP_WUNLOCK(inp); > + INP_INFO_WLOCK(&V_tcbinfo); > + INP_WLOCK(inp); > + if (in_pcbrele(inp)) { > + INP_INFO_WUNLOCK(&V_tcbinfo); > + CURVNET_RESTORE(); > + return; > + } > tp = tcp_drop(tp, tp->t_softerror ? > tp->t_softerror : ETIMEDOUT); > + headlocked = 1; > goto out; > } > - INP_INFO_WUNLOCK(&V_tcbinfo); > + INP_INFO_RUNLOCK(&V_tcbinfo); > headlocked = 0; > if (tp->t_rxtshift == 1) { > /* Recent survey results for tcp_timer_race leave me a bit worried about changes that open up greater potential races in the TCP timer code, and this one does worry me. When tcp_timer_race occurs, holding tcbinfo continuously across the timer is what helps mitigate the race. With dozens of sites reporting significantly non-zero instance of the bug, I'm worried that this change could allow a conversion from "silently mitigated with a counter bump" to "the system panics or similar". I need to think a bit more about the exact nature of the bug, but it could be that MFC'ing this part of the change before tcp_timer_race is fixed could have unfortunate stability consequences. (Since you're working only with long-lived connections with relatively little turnover, you may not see this in testing -- however, you can check net.inet.tcp.timer_race to see if production systems see it). Interestingly, it wasn't just 8-core systems that appeared in the reports, there were also uniprocessor systems. On an unrelated note: I think it would be useful to rename 'headlocked' in this function to 'headwlocked', since it no longer tracks whether tcbinfo is locked at all, it just tracks whether it is write-locked. Robert N M Watson Computer Laboratory University of Cambridge
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1003210953320.80510>