Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 08 Oct 2014 15:29:32 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Jason Wolfe <nitroboost@gmail.com>
Cc:        Sean Bruno <sbruno@llnw.com>, freebsd-net@freebsd.org, Eric van Gyzen <eric@vangyzen.net>, davide@freebsd.org
Subject:   Re: ixgbe(4) spin lock held too long
Message-ID:  <2077446.sYcZo9xEXb@ralph.baldwin.cx>
In-Reply-To: <CAAAm0r0=7G=MgKEpXGDLqYE5%2BK9gfV3PLWJ=hLfvf2KWvPtd7w@mail.gmail.com>
References:  <1410203348.1343.1.camel@bruno> <201410071428.15753.jhb@freebsd.org> <CAAAm0r0=7G=MgKEpXGDLqYE5%2BK9gfV3PLWJ=hLfvf2KWvPtd7w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, October 08, 2014 10:56:56 AM Jason Wolfe wrote:
> On Tue, Oct 7, 2014 at 11:28 AM, John Baldwin <jhb@freebsd.org> wrote:
> > On Tuesday, October 07, 2014 2:06:42 pm Jason Wolfe wrote:
> > > Hey John,
> > > 
> > > Happy to do this, but the pool of boxes is about 500 large, which is the
> > > reason I'm able to see a crash every day or so.  I've pulled a portion
> > > of them out of service to switch them over to a a kernel with WITNESS options,
> > > might that assist in the same manner if we catch a crash there?  If you're
> > > thinking the KTR traces are required to diagnose this I'll get started on
> > > that though.  I've also cut 10.1-RC1 with some ixgbe fixes for future use,
> > > but noticed kern_timeout.c has been static for the past 8 months. 
> > > Thanks again for your help.
> > 
> > I don't think WITNESS will really help in this case.  ixgbe just happens to
> > be trying to schedule a timer and running into a problem that the timer is
> > spinning on another CPU.  While the kern_timeout code hasn't changed in a
> > while, it may be an edge case provoked by another bug elsewhere.  It's also
> > possible I suppose that a lot of callouts are scheduled for some reason.
> > 
> > Hmm, maybe we can still figure this out from your dumps.  We can still get to
> > '*cc' via an alternate route.  First, IIRC, the thread is an idle thread,
> > so its name should indicate which CPU it is for:
> > 
> > % sudo procstat -at  | grep 100003
> > 
> >    11 100003 idle             idle: cpu0         0  255 run     -
> > 
> > (Hmm, I bet that is always CPU 0).  That means we can find 'cc' by doing
> > 'p cc_cpu[0]' in kgdb.  Can you do that from the last core dump along with
> > 'p last' in frame 4?
> > 
> > Grrr.  Looks like cc->cc_lastscan is already overwritten by the time you've
> > crashed, so we can't work back to the original value of 'firstb'.
> > 
> > However, there is a sysctl you can invoke to get callout stats.  Maybe run
> > 'sysctl kern.callout_stats=1' on one of your existing servers and then
> > check dmesg.  On an idle machine sitting here running stable/10 I have 86
> > callout scheduled currently:
> 
> Correct 100003 is always CPU0, on our end at least.  Jackpot on the
> cc_cpu[0] also, here are cores across 3 different 'spin lock callout held
> by tid 100003 too long' machines.  Callouts on an idle machine are in line
> with yours at 90, and a somewhat idle box pushing ~500Mb/s has in the
> 3000-4000 range.

Unfortunately, the thing I wanted out of cc_cpu[0] is the previous value of
cc_lastscan so I could work back to what 'firstb' is and see if it sticks in
the loop forever for some reason.  However, cc_lastscan is overwritten by the
time this cores.  Still, we can still tell something perhaps:

> (kgdb) tid 100003
> [Switching to thread 40 (Thread 100003)]#0  0xffffffff80ac39b8 in
> cpustop_handler ()
>     at /usr/src/sys/amd64/amd64/mp_machdep.c:1432
> 1432            savectx(&stoppcbs[cpu]);
> (kgdb) up 4
> #4  0xffffffff80733f79 in callout_process (now=3549882673091690) at
> /usr/src/sys/kern/kern_timeout.c:443
> 443                     tmp = LIST_FIRST(sc);
> (kgdb) list
> 438             }
> 439
> 440             /* Iterate callwheel from firstb to nowb and then up to
> lastb. */
> 441             do {
> 442                     sc = &cc->cc_callwheel[firstb & callwheelmask];
> 443                     tmp = LIST_FIRST(sc);
> 444                     while (tmp != NULL) {
> 445                             /* Run the callout if present time within
> allowed. */
> 446                             if (tmp->c_time <= now) {
> 447                                     /*
> (kgdb) p cc_cpu[0]
> $1 = {cc_lock = {lock_object = {lo_name = 0xffffffff80ccbec7 "callout",
> lo_flags = 720896, lo_data = 0, lo_witness = 0x0},
>     mtx_lock = 4}, cc_exec_entity = {{cc_next = 0x0, cc_curr = 0x0,
> ce_migration_func = 0, ce_migration_arg = 0x0,
>       ce_migration_cpu = 64, ce_migration_time = 0, ce_migration_prec = 0,
> cc_cancel = false, cc_waiting = false}, {
>       cc_next = 0xffffffff812a7af0, cc_curr = 0x0, ce_migration_func = 0,
> ce_migration_arg = 0x0, ce_migration_cpu = 64,
>       ce_migration_time = 0, ce_migration_prec = 0, cc_cancel = false,
> cc_waiting = false}},
>   cc_callout = 0xfffffe000069f000, cc_callwheel = 0xfffffe00007c1000,
> cc_expireq = {tqh_first = 0x0,
>     tqh_last = 0xffffffff812a0510}, cc_callfree = {slh_first =
> 0xfffffe00007c02c0}, cc_firstevent = 3549882673088716,
>   cc_lastscan = 3549882673091690, cc_cookie = 0xfffff800151f4b80, cc_bucket
> = 6515}
> (kgdb) p last
> $2 = 3549882929381376

So, the bucket for 'cc_lastscan' (which is 'now' when this was called) is only
16 buckets behind the bucket for last.  That doesn't seem like a lot of work
to be done:

(gdb) p (3549882673091690 >> (32 - 8))
$4 = 211589495
(gdb) p (3549882929381376 >> (32 - 8))
$5 = 211589511
(gdb) p $5 - $4
$6 = 16

OTOH, we know that cc_lastcan when this function was started was less than
cc_firstevent.  However, that is the same bucket as for 'cc_lastscan':

(gdb) p (3549882673088716 >> (32 - 8))
$7 = 211589495

My only other thought is if a direct timeout routine ran for a long time.

I just committed a change to current that can let you capture KTR traces of
callout routines for use with schedgraph (r272757).  Unfortunately,
enabling KTR_SCHED can be slow.  If you are up for trying it, I do think that
building a kernel with KTR and KTR_SCHED enabled (KTR_COMPILE=KTR_SCHED and
KTR_MASK=KTR_SCHED) with the kernel part of the commit I referenced above
applied is the best bet to figure out why it is spinning so long.  If you can
try that (and if it doesn't slow things down too much) and get a panic with
those options enabled, then capture the output of
'ktrdump -e /path/to/kernel -m /var/crash/vmcore.X -ct', we can use
Src/tools/sched/schedgraph.py to look at that output to get a picture of what
was going on just prior to the crash.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2077446.sYcZo9xEXb>