Date: Wed, 10 May 2017 15:29:39 -0700 From: Olivier Cinquin <olivier.cinquin@uci.edu> To: freebsd-stable@freebsd.org Subject: Kernel panic in smp rendezvous Message-ID: <EC4BA9D6-A5B2-45D3-8DE9-A98FEE20694F@uci.edu>
next in thread | raw e-mail | index | archive | help
Hi, I'm getting the following kernel panics on recent 11-STABLE (r317883): spin lock 0xffffffff81df43d0 (smp rendezvous) held by 0xfffff8019c7a7000 = (tid 100845) too long timeout stopping cpus panic: spin lock held too long cpuid =3D 12 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame = 0xfffffe3e64caf2a0 vpanic() at vpanic+0x186/frame 0xfffffe3e64caf320 panic() at panic+0x43/frame 0xfffffe3e64caf380 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x311/frame = 0xfffffe3e64caf3f0 smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x101/frame = 0xfffffe3e64caf470 smp_masked_invlpg_range() at smp_masked_invlpg_range+0x50/frame = 0xfffffe3e64caf4a0 pmap_invalidate_range() at pmap_invalidate_range+0x196/frame = 0xfffffe3e64caf4e0 vm_thread_stack_dispose() at vm_thread_stack_dispose+0x2f/frame = 0xfffffe3e64caf530 thread_free() at thread_free+0x39/frame 0xfffffe3e64caf550 thread_reap() at thread_reap+0x10e/frame 0xfffffe3e64caf570 proc_reap() at proc_reap+0x6bd/frame 0xfffffe3e64caf5b0 proc_to_reap() at proc_to_reap+0x48c/frame 0xfffffe3e64caf600 kern_wait6() at kern_wait6+0x49e/frame 0xfffffe3e64caf6b0 sys_wait4() at sys_wait4+0x78/frame 0xfffffe3e64caf8a0 amd64_syscall() at amd64_syscall+0x6c4/frame 0xfffffe3e64cafa30 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe3e64cafa30 The panics occur sporadically and seemingly randomly. They occur on multiple machines with the same configuration, and on older revisions of 11-STABLE as well as r317883 (I cannot easily bisect this down to a specific commit given that it can take days or sometimes weeks for the panic to occur). Note that my kernel is compiled with "options IPSEC", = but that does not seem to be relevant. My understanding of the overall problem is that an IPI is sent to all = cores by smp_rendezvous_cpus, but that some of the cores do not respond to it = (or at least do not respond to it correctly and in time). More specifically, = I can find 61 threads that seem to be blocked in cpustop_handler on an = atomic operation to indicate that they have stopped in response to the IPI. = This operation is CPU_SET_ATOMIC(cpu, &stopped_cpus); CPU_SET_ATOMIC boils = down to a call to "atomic_set_long" (not sure where that is itself defined = for amd64). Either there is a line numbering problem, or this suggests that perhaps there is a deadlock at this step (unless there's some sort of livelock and CPU_SET_ATOMIC is the place where the threads spend most of their time). Looking at the thread backtraces as a whole, I can see the one that triggered the panic, the 61 that are in cpustop_handler, and a lot of sleeping threads. I guess each core should have an active thread, and = this is an architecture with 64 cores, so that leaves 64 - (1 + 61) =3D 2 = cores that are unaccounted for. Interestingly, for 2 different panics for = which I have a vmcore file these numbers are the same. For these two panics, the IDs of the cores that are neither in cpustop_handler nor calling the smp rendez vous are not the same (#18 and #19 in one instance, #44 and #45 = in the other instance) but in both instances the IDs are consecutive; = perhaps that's relevant. I've uploaded a full set of backtraces at https://gist.github.com/cinquin/d63cdf9de01b0b8033c47128868f2d38 and a dmesg at = https://gist.github.com/cinquin/17c0cf6ac7832fd1b683488d08d3e69b (in short, the machine is a 4 processor Opteron 6274 system). I'm = pasting below an example backtrace of the threads in stopcpu_handler. Any suggestions as to what the problem might be and how to solve it? = I've saved the core and can provide further information. Thanks Olivier Cinquin A total of 61 threads are along the lines of #0 cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275 #1 0xffffffff8105a6f5 in ipi_nmi_handler () at = /usr/src/sys/x86/x86/mp_x86.c:1231 #2 0xffffffff80ef740a in trap (frame=3D0xfffffe3e687f7f30) at = /usr/src/sys/amd64/amd64/trap.c:185 #3 0xffffffff80edbd03 in nmi_calltrap () at = /usr/src/sys/amd64/amd64/exception.S:510 ... (kgdb) frame 0 #0 cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275 1275 CPU_SET_ATOMIC(cpu, &stopped_cpus); (kgdb) list 1270 cpu =3D PCPU_GET(cpuid); 1271=09 1272 savectx(&stoppcbs[cpu]); 1273=09 1274 /* Indicate that we are stopped */ 1275 CPU_SET_ATOMIC(cpu, &stopped_cpus); 1276=09 1277 /* Wait for restart */ 1278 while (!CPU_ISSET(cpu, &started_cpus)) 1279 ia32_pause(); (kgdb) p stopped_cpus $1 =3D {__bits =3D 0xffffffff81df4368} (kgdb) p started_cpus $2 =3D {__bits =3D 0xffffffff81df4418}
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?EC4BA9D6-A5B2-45D3-8DE9-A98FEE20694F>