Date: Fri, 10 Oct 2014 11:53:30 -0400 From: John Baldwin <jhb@freebsd.org> To: Jason Wolfe <nitroboost@gmail.com> Cc: Sean Bruno <sbruno@llnw.com>, freebsd-net@freebsd.org Subject: Re: ixgbe(4) spin lock held too long Message-ID: <3567780.Mf6fMnzmYG@ralph.baldwin.cx> In-Reply-To: <CAAAm0r1N=F6wgroVJZx3CR-bzti45=VA==dv1VaVhx1hSufUEQ@mail.gmail.com> References: <1410203348.1343.1.camel@bruno> <2077446.sYcZo9xEXb@ralph.baldwin.cx> <CAAAm0r1N=F6wgroVJZx3CR-bzti45=VA==dv1VaVhx1hSufUEQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, October 09, 2014 02:31:32 PM Jason Wolfe wrote: > On Wed, Oct 8, 2014 at 12:29 PM, John Baldwin <jhb@freebsd.org> wrote: > > My only other thought is if a direct timeout routine ran for a long time. > > > > I just committed a change to current that can let you capture KTR traces > > of > > callout routines for use with schedgraph (r272757). Unfortunately, > > enabling KTR_SCHED can be slow. If you are up for trying it, I do think > > that > > building a kernel with KTR and KTR_SCHED enabled (KTR_COMPILE=KTR_SCHED > > and > > KTR_MASK=KTR_SCHED) with the kernel part of the commit I referenced above > > applied is the best bet to figure out why it is spinning so long. If you > > can > > try that (and if it doesn't slow things down too much) and get a panic > > with > > those options enabled, then capture the output of > > 'ktrdump -e /path/to/kernel -m /var/crash/vmcore.X -ct', we can use > > Src/tools/sched/schedgraph.py to look at that output to get a picture of > > what > > was going on just prior to the crash. > > > > -- > > John Baldwin > > As luck would have it.. caught one of the boxes with the new KTR > code/options after only an hour. Crashed in the same way w tid 100003 and > looking the same in callout_process > > spin lock 0xffffffff81262d00 (callout) held by 0xfffff800151fe000 (tid > 100003) too long > spin lock 0xffffffff81262d00 (callout) held by 0xfffff800151fe000 (tid > 100003) too long > > #4 0xffffffff8070d6fa in callout_process (now=7915682202423) at > /usr/src/sys/kern/kern_ > timeout.c:490 > > The ktrdump oddly only seems to have the last 702, though debug.ktr.entries > is set to 1024. It appears the buffer may also start after 100003 had > already hung? I've bumped debug.ktr.entries up in case we don't have > enough history here. > > http://nitrology.com/ktrdump-spinlock.txt Hmm, schedgraph.py chokes on this, but it's a bit interesting. It seems that in this time sample, CPUs 1, 2, 4, and 5 were constantly running ixgbe interrupt handlers. No actual thread state changes are logged (which is why schedgraph choked). Try setting the sysctl 'net.inet.tcp.per_cpu_timers' to 1 and see if that makes a difference. I'm guessing they are all contending on the default callout lock which is causing your headache. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3567780.Mf6fMnzmYG>