Date: Thu, 30 Apr 2020 16:35:29 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 246044] Random freezes of the OS from TLB shootdown. Message-ID: <bug-246044-227-r3aG5KNMKe@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-246044-227@https.bugs.freebsd.org/bugzilla/> References: <bug-246044-227@https.bugs.freebsd.org/bugzilla/>
index | next in thread | previous in thread | raw e-mail
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=246044 --- Comment #1 from Conrad Meyer <cem@freebsd.org> --- It looks like your panic thread is blocked on a spin lock held by another thread: spin lock 0xffffffff822e77c0 (smp rendezvous) held by 0xfffff8007b8da000 (tid 102076) too long timeout stopping cpus panic: spin lock held too long cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0467c68520 vpanic() at vpanic+0x17e/frame 0xfffffe0467c68580 panic() at panic+0x43/frame 0xfffffe0467c685e0 _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x6d/frame 0xfffffe0467c685f0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd4/frame 0xfffffe0467c68660 smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x3ae/frame 0xfffffe0467c686d0 pmap_invalidate_range() at pmap_invalidate_range+0x25a/frame 0xfffffe0467c68730 vm_thread_new() at vm_thread_new+0x13d/frame 0xfffffe0467c68870 thread_alloc() at thread_alloc+0x31/frame 0xfffffe0467c68890 thread_create() at thread_create+0x13d/frame 0xfffffe0467c688e0 sys_thr_new() at sys_thr_new+0x81/frame 0xfffffe0467c68980 amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe0467c68ab0 What is 102076 doing? 102076 Run CPU 10 reader#2 Tracing command collectd pid 2525 tid 102076 td 0xfffff8007b8da000 cpustop_handler() at cpustop_handler+0x2f/frame 0xfffffe03e2a3ddf0 ipi_nmi_handler() at ipi_nmi_handler+0x40/frame 0xfffffe03e2a3de10 trap() at trap+0x39/frame 0xfffffe03e2a3df20 nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe03e2a3df20 --- trap 0x13, rip = 0xffffffff811a9562, rsp = 0xfffffe0468912670, rbp = 0xfffffe04689126d0 --- smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x2f2/frame 0xfffffe04689126d0 pmap_invalidate_range() at pmap_invalidate_range+0x25a/frame 0xfffffe0468912730 vm_thread_new() at vm_thread_new+0x13d/frame 0xfffffe0468912870 thread_alloc() at thread_alloc+0x31/frame 0xfffffe0468912890 thread_create() at thread_create+0x13d/frame 0xfffffe04689128e0 sys_thr_new() at sys_thr_new+0x81/frame 0xfffffe0468912980 amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe0468912ab0 Ok, it holds ipi mtx while waiting for all other CPUs to bump pc_smp_tlb_done. Why would a CPU not finish IPI work fast enough? Well, x86 interrupts are maskable (aside from NMI). Are any other CPUs masking interrupts "for a long time?" Outside of: 0xfffff8007b8da000: pid 2525 "reader#2" (102076) 0xfffff8005537c620: pid 137 "python3.7" (100777) CPUs are all idle. So both threads are in pmap_invalidate_range() -> smp_targeted_tlb_shootdown(), and python is spinning on smp_ipi_mtx. smp_targeted_tlb_shootdown() does sanity check that interrupts are enabled before even attempting to spin on smp_ipi_mtx. Thread 102076 has tlb generation 26971, which is sort of odd but apparently TLB shootdown does not set the generation of the initiating CPU. What is odd is that our victim (python, 100777) and CPU1's idle thread have what appears to be current TLB generations, but the other *idle* threads do not: cpuid = 0 curthread = 0xfffff8005537c620: pid 137 "python3.7" tlb gen = 26980 cpuid = 1 curthread = 0xfffff800053f3620: pid 11 "idle: cpu1" tlb gen = 26980 cpuid = 2 curthread = 0xfffff800053f3000: pid 11 "idle: cpu2" tlb gen = 26971 ... Maybe just a function of pmap->pm_active, except kernel pmap should never be inactive and these appear to be kernel memory (threads). Ok, here's something interesting. cpu2 idle td is 100005 and cpu3 is 100006: Tracing command idle pid 11 tid 100005 td 0xfffff800053f3000 sched_switch() at sched_switch+0x64a/frame 0xfffffe03e2a8c800 timercb() at timercb+0xad/frame 0xfffffe03e2a8c850 lapic_handle_timer() at lapic_handle_timer+0xb5/frame 0xfffffe03e2a8c890 Xtimerint() at Xtimerint+0xa5/frame 0xfffffe03e2a8c890 --- interrupt, rip = 0xffffffff8119a11b, rsp = 0xfffffe03e2a8c970, rbp = 0xfffffe03e2a8c970 --- cpu_idle_hlt() at cpu_idle_hlt+0x2b/frame 0xfffffe03e2a8c970 cpu_idle() at cpu_idle+0x94/frame 0xfffffe03e2a8c990 sched_idletd() at sched_idletd+0x476/frame 0xfffffe03e2a8ca70 fork_exit() at fork_exit+0x83/frame 0xfffffe03e2a8cab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe03e2a8cab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- cpu2 is in a timer interrupt, maybe for a long time?? But what is really weird is cpu3: Tracing command idle pid 11 tid 100006 td 0xfffff80005406620 sched_switch() at sched_switch+0x64a/frame 0xfffffe03e2a91940 ??() at 0x3c571e542/frame 0xfffffe04d13c3c80 Which is just a totally bogus return address. The frequent and unpredictable crashes, as well as cpu3's stack, are consistent with bad RAM. Additionally, your dmidecode indicates the RAM is not ECC-protected. Consider running memtest86 and/or using ECC RAM. -- You are receiving this mail because: You are the assignee for the bug.help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-246044-227-r3aG5KNMKe>
