Date: Wed, 20 Aug 2025 14:30:20 +0200 From: Kristof Provost <kp@FreeBSD.org> To: FreeBSD Net <freebsd-net@freebsd.org> Subject: rtentry_free panic Message-ID: <163785B5-236A-4C19-8475-66E9E8912DFA@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
--=_MailMate_95448EBF-6D1F-4875-8BBB-64FBF9788216_= Content-Type: text/plain; charset=UTF-8; format=flowed; markup=markdown Content-Transfer-Encoding: 8bit Hi, Running the pf tests I very occasional (say 1 out of 10 runs) see panics freeing an rtentry. This mostly manifests during bricoler test runs, and usually with the KMSAN kernel config. I assume that’s because there’s a timing factor involved rather than it being an issue that’s directly detected by KMSAN/KASAN. Here’s the panic: Freed UMA keg (rtentry) was not empty (2 items). Lost 1 pages of memory. Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x2 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff81896d53 stack pointer = 0x28:0xfffffe0098468b20 frame pointer = 0x28:0xfffffe0098468bb0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (softirq_3) rdi: 0000000000000000 rsi: fffffe00e08b67e0 rdx: 0000000000000000 rcx: fffffe000c5f08d8 r8: 0000000000000000 r9: 0000000000000001 rax: fffffe0000000000 rbx: 0000000000000000 rbp: fffffe0098468bb0 r10: 0000000000000001 r11: 0000000000000005 r12: 0000000000000000 r13: fffffe0155c46920 r14: 0000000000000000 r15: fffffe00e08b67e0 trap number = 12 panic: page fault cpuid = 3 time = 1754664399 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 0xfffffe0098468390 kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe00984684f0 vpanic() at vpanic+0x214/frame 0xfffffe0098468690 panic() at panic+0xb5/frame 0xfffffe0098468750 trap_pfault() at trap_pfault+0x7e4/frame 0xfffffe0098468870 trap() at trap+0x765/frame 0xfffffe0098468a50 calltrap() at calltrap+0x8/frame 0xfffffe0098468a50 --- trap 0xc, rip = 0xffffffff81896d53, rsp = 0xfffffe0098468b20, rbp = 0xfffffe0098468bb0 --- uma_zfree_arg() at uma_zfree_arg+0x23/frame 0xfffffe0098468bb0 destroy_rtentry_epoch() at destroy_rtentry_epoch+0x17a/frame 0xfffffe0098468c70 epoch_call_task() at epoch_call_task+0x26d/frame 0xfffffe0098468d50 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x366/frame 0xfffffe0098468eb0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x138/frame 0xfffffe0098468ef0 fork_exit() at fork_exit+0xa3/frame 0xfffffe0098468f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0098468f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 0 tid 100010 ] Stopped at kdb_enter+0x34: movq $0,0x20d7651(%rip) We’re panicing because the V_rtzone zone has been cleaned up (in vnet_rtzone_destroy()). I explicitly NULL out V_rtzone too, to make this more obvious. Note that we failed to completely free all rtentries (`Freed UMA keg (rtentry) was not empty (2 items). Lost 1 pages of memory.`). Presumably at least on of those two gets freed later, and that’s the panic we see. rt_free() queues the actual delete as an epoch callback (`NET_EPOCH_CALL(destroy_rtentry_epoch, &rt->rt_epoch_ctx);`), and that’s what we see here: the zone is removed before we’re done freeing all of the rtentries. vnet_rtzone_destroy() is called from rtables_destroy(), but that explicitly calls NET_EPOCH_DRAIN_CALLBACKS() first, so I’d expect all of the pending cleanups to have been done at that point. The comment block above does suggest that there may still be nexthop entries pending deletion even after the we drain the callbacks. I think I can see how that’d happen for nexthops, but I do not see how it can happen for rtentries. Has anyone else seen this panic or have any ideas what I’m missing? Thanks, Kristof --=_MailMate_95448EBF-6D1F-4875-8BBB-64FBF9788216_= Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <!DOCTYPE html> <html> <head> <meta http-equiv=3D"Content-Type" content=3D"text/xhtml; charset=3Dutf-8"= > </head> <body><div style=3D"font-family: sans-serif;"><div class=3D"markdown" sty= le=3D"white-space: normal;"> <p dir=3D"auto">Hi,</p> <p dir=3D"auto">Running the pf tests I very occasional (say 1 out of 10 r= uns) see panics freeing an rtentry.<br> This mostly manifests during bricoler test runs, and usually with the KMS= AN kernel config. I assume that=E2=80=99s because there=E2=80=99s a timin= g factor involved rather than it being an issue that=E2=80=99s directly d= etected by KMSAN/KASAN.</p> <p dir=3D"auto">Here=E2=80=99s the panic:</p> <pre style=3D"margin-left: 15px; margin-right: 15px; padding: 5px; border= : thin solid gray; overflow-x: auto; max-width: 90vw; background-color: #= E4E4E4;"><code style=3D"padding: 0 0.25em; background-color: #E4E4E4;">Fr= eed UMA keg (rtentry) was not empty (2 items). Lost 1 pages of memory. Fatal trap 12: page fault while in kernel mode cpuid =3D 3; apic id =3D 03 fault virtual address =3D 0x2 fault code =3D supervisor read data, page not present instruction pointer =3D 0x20:0xffffffff81896d53 stack pointer =3D 0x28:0xfffffe0098468b20 frame pointer =3D 0x28:0xfffffe0098468bb0 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 0 (softirq_3) rdi: 0000000000000000 rsi: fffffe00e08b67e0 rdx: 0000000000000000 rcx: fffffe000c5f08d8 r8: 0000000000000000 r9: 0000000000000001 rax: fffffe0000000000 rbx: 0000000000000000 rbp: fffffe0098468bb0 r10: 0000000000000001 r11: 0000000000000005 r12: 0000000000000000 r13: fffffe0155c46920 r14: 0000000000000000 r15: fffffe00e08b67e0 trap number =3D 12 panic: page fault cpuid =3D 3 time =3D 1754664399 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 0xfffffe00984= 68390 kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe00984684f0 vpanic() at vpanic+0x214/frame 0xfffffe0098468690 panic() at panic+0xb5/frame 0xfffffe0098468750 trap_pfault() at trap_pfault+0x7e4/frame 0xfffffe0098468870 trap() at trap+0x765/frame 0xfffffe0098468a50 calltrap() at calltrap+0x8/frame 0xfffffe0098468a50 --- trap 0xc, rip =3D 0xffffffff81896d53, rsp =3D 0xfffffe0098468b20, rbp= =3D 0xfffffe0098468bb0 --- uma_zfree_arg() at uma_zfree_arg+0x23/frame 0xfffffe0098468bb0 destroy_rtentry_epoch() at destroy_rtentry_epoch+0x17a/frame 0xfffffe0098= 468c70 epoch_call_task() at epoch_call_task+0x26d/frame 0xfffffe0098468d50 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x366/frame 0xfffffe0098= 468eb0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x138/frame 0xfffffe00= 98468ef0 fork_exit() at fork_exit+0xa3/frame 0xfffffe0098468f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0098468f30 --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 --- KDB: enter: panic [ thread pid 0 tid 100010 ] Stopped at kdb_enter+0x34: movq $0,0x20d7651(%rip) </code></pre> <p dir=3D"auto">We=E2=80=99re panicing because the V_rtzone zone has been= cleaned up (in vnet_rtzone_destroy()). I explicitly NULL out V_rtzone to= o, to make this more obvious.<br> Note that we failed to completely free all rtentries (<code style=3D"padd= ing: 0 0.25em; background-color: #E4E4E4;">Freed UMA keg (rtentry) was no= t empty (2 items). Lost 1 pages of memory.</code>). Presumably at least = on of those two gets freed later, and that=E2=80=99s the panic we see.</p= > <p dir=3D"auto">rt_free() queues the actual delete as an epoch callback (= <code style=3D"padding: 0 0.25em; background-color: #E4E4E4;">NET_EPOCH_C= ALL(destroy_rtentry_epoch, &rt->rt_epoch_ctx);</code>), and that=E2= =80=99s what we see here: the zone is removed before we=E2=80=99re done f= reeing all of the rtentries.</p> <p dir=3D"auto">vnet_rtzone_destroy() is called from rtables_destroy(), b= ut that explicitly calls NET_EPOCH_DRAIN_CALLBACKS() first, so I=E2=80=99= d expect all of the pending cleanups to have been done at that point. Th= e comment block above does suggest that there may still be nexthop entrie= s pending deletion even after the we drain the callbacks. I think I can s= ee how that=E2=80=99d happen for nexthops, but I do not see how it can ha= ppen for rtentries.</p> <p dir=3D"auto">Has anyone else seen this panic or have any ideas what I=E2= =80=99m missing?</p> <p dir=3D"auto">Thanks,<br> Kristof</p> </div> </div> </body> </html> --=_MailMate_95448EBF-6D1F-4875-8BBB-64FBF9788216_=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?163785B5-236A-4C19-8475-66E9E8912DFA>