From owner-freebsd-bugs@freebsd.org Thu Apr 30 16:35:29 2020 Return-Path: Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id B26ED2C2603 for ; Thu, 30 Apr 2020 16:35:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 49Cgwd4Mzmz3HmY for ; Thu, 30 Apr 2020 16:35:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 944062C2601; Thu, 30 Apr 2020 16:35:29 +0000 (UTC) Delivered-To: bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 92E0D2C2600 for ; Thu, 30 Apr 2020 16:35:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49Cgwd3PkHz3HmX for ; Thu, 30 Apr 2020 16:35:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 7057E21D1B for ; Thu, 30 Apr 2020 16:35:29 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 03UGZTnw099255 for ; Thu, 30 Apr 2020 16:35:29 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 03UGZT2h099253 for bugs@FreeBSD.org; Thu, 30 Apr 2020 16:35:29 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 246044] Random freezes of the OS from TLB shootdown. Date: Thu, 30 Apr 2020 16:35:29 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: cem@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Apr 2020 16:35:29 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D246044 --- Comment #1 from Conrad Meyer --- It looks like your panic thread is blocked on a spin lock held by another thread: spin lock 0xffffffff822e77c0 (smp rendezvous) held by 0xfffff8007b8da000 (t= id 102076) too long timeout stopping cpus panic: spin lock held too long cpuid =3D 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0467c68= 520 vpanic() at vpanic+0x17e/frame 0xfffffe0467c68580 panic() at panic+0x43/frame 0xfffffe0467c685e0 _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x6d/frame 0xfffffe0467c685f0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd4/frame 0xfffffe0467c68= 660 smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x3ae/frame 0xfffffe0467c686d0 pmap_invalidate_range() at pmap_invalidate_range+0x25a/frame 0xfffffe0467c6= 8730 vm_thread_new() at vm_thread_new+0x13d/frame 0xfffffe0467c68870 thread_alloc() at thread_alloc+0x31/frame 0xfffffe0467c68890 thread_create() at thread_create+0x13d/frame 0xfffffe0467c688e0 sys_thr_new() at sys_thr_new+0x81/frame 0xfffffe0467c68980 amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe0467c68ab0 What is 102076 doing? 102076 Run CPU 10 reader#2 Tracing command collectd pid 2525 tid 102076 td 0xfffff8007b8da000 cpustop_handler() at cpustop_handler+0x2f/frame 0xfffffe03e2a3ddf0 ipi_nmi_handler() at ipi_nmi_handler+0x40/frame 0xfffffe03e2a3de10 trap() at trap+0x39/frame 0xfffffe03e2a3df20 nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe03e2a3df20 --- trap 0x13, rip =3D 0xffffffff811a9562, rsp =3D 0xfffffe0468912670, rbp = =3D 0xfffffe04689126d0 --- smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x2f2/frame 0xfffffe04689126d0 pmap_invalidate_range() at pmap_invalidate_range+0x25a/frame 0xfffffe046891= 2730 vm_thread_new() at vm_thread_new+0x13d/frame 0xfffffe0468912870 thread_alloc() at thread_alloc+0x31/frame 0xfffffe0468912890 thread_create() at thread_create+0x13d/frame 0xfffffe04689128e0 sys_thr_new() at sys_thr_new+0x81/frame 0xfffffe0468912980 amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe0468912ab0 Ok, it holds ipi mtx while waiting for all other CPUs to bump pc_smp_tlb_done. Why would a CPU not finish IPI work fast enough? Well, x86 interrupts are maskable (aside from NMI). Are any other CPUs masking interrupts "for a long time?" Outside of: 0xfffff8007b8da000: pid 2525 "reader#2" (102076) 0xfffff8005537c620: pid 137 "python3.7" (100777) CPUs are all idle. So both threads are in pmap_invalidate_range() -> smp_targeted_tlb_shootdow= n(), and python is spinning on smp_ipi_mtx. smp_targeted_tlb_shootdown() does sanity check that interrupts are enabled before even attempting to spin on smp_ipi_mtx. Thread 102076 has tlb generation 26971, which is sort of odd but apparently TLB shootdown does not set the generation of the initiating CPU. What is odd is that our victim (python, 100777) and CPU1's idle thread have what appears to be current TLB generations, but the other *idle* threads do not: cpuid =3D 0 curthread =3D 0xfffff8005537c620: pid 137 "python3.7" tlb gen =3D 26980 cpuid =3D 1 curthread =3D 0xfffff800053f3620: pid 11 "idle: cpu1" tlb gen =3D 26980 cpuid =3D 2 curthread =3D 0xfffff800053f3000: pid 11 "idle: cpu2" tlb gen =3D 26971 ... Maybe just a function of pmap->pm_active, except kernel pmap should never be inactive and these appear to be kernel memory (threads). Ok, here's something interesting. cpu2 idle td is 100005 and cpu3 is 100006: Tracing command idle pid 11 tid 100005 td 0xfffff800053f3000 sched_switch() at sched_switch+0x64a/frame 0xfffffe03e2a8c800 timercb() at timercb+0xad/frame 0xfffffe03e2a8c850 lapic_handle_timer() at lapic_handle_timer+0xb5/frame 0xfffffe03e2a8c890 Xtimerint() at Xtimerint+0xa5/frame 0xfffffe03e2a8c890 --- interrupt, rip =3D 0xffffffff8119a11b, rsp =3D 0xfffffe03e2a8c970, rbp = =3D 0xfffffe03e2a8c970 --- cpu_idle_hlt() at cpu_idle_hlt+0x2b/frame 0xfffffe03e2a8c970 cpu_idle() at cpu_idle+0x94/frame 0xfffffe03e2a8c990 sched_idletd() at sched_idletd+0x476/frame 0xfffffe03e2a8ca70 fork_exit() at fork_exit+0x83/frame 0xfffffe03e2a8cab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe03e2a8cab0 --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 --- cpu2 is in a timer interrupt, maybe for a long time?? But what is really weird is cpu3: Tracing command idle pid 11 tid 100006 td 0xfffff80005406620 sched_switch() at sched_switch+0x64a/frame 0xfffffe03e2a91940 ??() at 0x3c571e542/frame 0xfffffe04d13c3c80 Which is just a totally bogus return address. The frequent and unpredictable crashes, as well as cpu3's stack, are consistent with bad RAM. Additionally, your dmidecode indicates the RAM is not ECC-protected. Consider running memtest86 and/or using ECC RAM. --=20 You are receiving this mail because: You are the assignee for the bug.=