Date: Mon, 7 Mar 2022 21:54:26 +0100 (CET) From: Ronald Klop <ronald-lists@klop.ws> To: Mark Johnston <markj@freebsd.org> Cc: bob prohaska <fbsd@www.zefox.net>, Mark Millard <marklmi@yahoo.com>, freebsd-arm@freebsd.org, freebsd-current <freebsd-current@freebsd.org> Subject: Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28)) Message-ID: <1302689164.173.1646686466515@mailrelay> In-Reply-To: <YiYhIQXl1sd4cOVS@nuc> References: <C2F96211-0180-45DA-872F-52358D9ED35B.ref@yahoo.com> <C2F96211-0180-45DA-872F-52358D9ED35B@yahoo.com> <1800459695.1.1646649539521@mailrelay> <132978150.92.1646660769467@mailrelay> <YiYhIQXl1sd4cOVS@nuc>
next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_172_1254189170.1646686466401 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Van: Mark Johnston <markj@freebsd.org> Datum: maandag, 7 maart 2022 16:13 Aan: Ronald Klop <ronald-lists@klop.ws> CC: bob prohaska <fbsd@www.zefox.net>, Mark Millard <marklmi@yahoo.com>, freebsd-arm@freebsd.org, freebsd-current <freebsd-current@freebsd.org> Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28)) > > On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote: > > Dear Mark Johnston, > > > > I did some binary search in the kernels and came to the conclusion that https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e still works and https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a panics. > > > > I suspect your commit in https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a. > > > > Last panic: > > > > panic: vm_fault failed: ffff00000046e708 error 1 > > cpuid = 1 > > time = 1646660058 > > KDB: stack backtrace: > > db_trace_self() at db_trace_self > > db_trace_self_wrapper() at db_trace_self_wrapper+0x30 > > vpanic() at vpanic+0x174 > > panic() at panic+0x44 > > data_abort() at data_abort+0x2e8 > > handle_el1h_sync() at handle_el1h_sync+0x10 > > --- exception, esr 0x96000004 > > _rm_rlock_debug() at _rm_rlock_debug+0x8c > > osd_get() at osd_get+0x5c > > zio_execute() at zio_execute+0xf8 > > taskqueue_run_locked() at taskqueue_run_locked+0x178 > > taskqueue_thread_loop() at taskqueue_thread_loop+0xc8 > > fork_exit() at fork_exit+0x74 > > fork_trampoline() at fork_trampoline+0x14 > > KDB: enter: panic > > [ thread pid 0 tid 100129 ] > > Stopped at kdb_enter+0x44: undefined f902011f > > db> > > > > A more recent kernel (912df91) still panics. See below. > > > > Do you have time to look into this? What can I provide in information to help? > > Hmm. So after my rmlock commits, we have the following disassembly for > _rm_rlock() (with a few annotations added by me). Note that the pcpu > pointer is stored in register x18 by convention. > > 0xffff00000046e304 <+0>: stp x29, x30, [sp, #-16]! > 0xffff00000046e308 <+4>: mov x29, sp > 0xffff00000046e30c <+8>: ldr x8, [x18] > 0xffff00000046e310 <+12>: ldr x9, [x18] > 0xffff00000046e314 <+16>: ldr x10, [x18] > 0xffff00000046e318 <+20>: cmp x9, x10 > 0xffff00000046e31c <+24>: b.ne 0xffff00000046e3cc <_rm_rlock+200> // b.any > 0xffff00000046e320 <+28>: ldr x9, [x18] > 0xffff00000046e324 <+32>: ldrh w9, [x9, #314] > 0xffff00000046e328 <+36>: cbnz w9, 0xffff00000046e3c0 <_rm_rlock+188> > 0xffff00000046e32c <+40>: str wzr, [x1, #32] > 0xffff00000046e330 <+44>: stp x0, x8, [x1, #16] > 0xffff00000046e334 <+48>: ldrb w9, [x0, #10] > 0xffff00000046e338 <+52>: tbz w9, #4, 0xffff00000046e358 <_rm_rlock+84> > 0xffff00000046e33c <+56>: ldr x9, [x18] > 0xffff00000046e340 <+60>: ldr w10, [x9, #888] > 0xffff00000046e344 <+64>: add w10, w10, #0x1 > 0xffff00000046e348 <+68>: str w10, [x9, #888] > 0xffff00000046e34c <+72>: ldr x9, [x18] > 0xffff00000046e350 <+76>: ldr w9, [x9, #888] > 0xffff00000046e354 <+80>: cbz w9, 0xffff00000046e3f4 <_rm_rlock+240> > 0xffff00000046e358 <+84>: ldr w9, [x8, #1212] > 0xffff00000046e35c <+88>: add x10, x18, #0x90 > 0xffff00000046e360 <+92>: add w9, w9, #0x1 > 0xffff00000046e364 <+96>: str w9, [x8, #1212] <------- critical_enter > 0xffff00000046e368 <+100>: str x10, [x1, #8] > 0xffff00000046e36c <+104>: ldr x9, [x18, #144] > 0xffff00000046e370 <+108>: str x9, [x1] > 0xffff00000046e374 <+112>: str x1, [x9, #8] > 0xffff00000046e378 <+116>: str x1, [x18, #144] > 0xffff00000046e37c <+120>: ldr x9, [x18] > 0xffff00000046e380 <+124>: ldr w10, [x9, #356] > 0xffff00000046e384 <+128>: add w10, w10, #0x1 > 0xffff00000046e388 <+132>: str w10, [x9, #356] > 0xffff00000046e38c <+136>: ldr w9, [x8, #1212] > 0xffff00000046e390 <+140>: sub w9, w9, #0x1 > 0xffff00000046e394 <+144>: str w9, [x8, #1212] <------- critical_exit > 0xffff00000046e398 <+148>: ldrb w8, [x8, #304] > 0xffff00000046e39c <+152>: ldr w9, [x18, #60] <------- loading &pc->pc_cpuid > ... > > A (the?) problem is that the compiler is treating "pc" as an alias > for x18, but the rmlock code assumes that the pcpu pointer is loaded > once, as it dereferences "pc" outside of the critical section. On > arm64, if a context switch occurs between the store at _rm_rlock+144 and > the load at +152, and the thread is migrated to another CPU, then we'll > end up using the wrong CPU ID in the rm->rm_writecpus test. > > I suspect the problem is unique to arm64 as its get_pcpu() > implementation is different from the others in that it doesn't use > volatile-qualified inline assembly. This has been the case since > https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762 > . > > I haven't been able to reproduce any crashes running poudriere in an > arm64 AWS instance, though. Could you please try the patch below and > confirm whether it fixes your panics? I verified that the apparent > problem described above is gone with the patch. > > diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c > index 0cdcfb8fec62..e51c25136ae0 100644 > --- a/sys/kern/kern_rmlock.c > +++ b/sys/kern/kern_rmlock.c > @@ -437,6 +437,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock) > { > struct thread *td = curthread; > struct pcpu *pc; > + int cpuid; > > if (SCHEDULER_STOPPED()) > return (1); > @@ -452,6 +453,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock) > atomic_interrupt_fence(); > > pc = get_pcpu(); > + cpuid = pc->pc_cpuid; > rm_tracker_add(pc, tracker); > sched_pin(); > > @@ -463,7 +465,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock) > * conditional jump. > */ > if (__predict_true(0 == (td->td_owepreempt | > - CPU_ISSET(pc->pc_cpuid, &rm->rm_writecpus)))) > + CPU_ISSET(cpuid, &rm->rm_writecpus)))) > return (1); > > /* We do not have a read token and need to acquire one. */ > > > Hi, This patch paniced again: x0: ffffa00005a31500 x1: ffffa00005a0e000 x2: 2 x3: ffffa00076c4e9a0 x4: 0 x5: e672743c8f9e5 x6: dc89f70500ab1 x7: 14 x8: ffffa00005a31518 x9: 1 x10: ffffa00005a0e000 x11: 0 x12: 0 x13: a x14: 1013e6b85a8ecbe4 x15: 1dce740d11a5 x16: ffff3ea86e2434bf x17: fffffffffffffff2 x18: ffff0000fe661800 (g_ctx + fcf9fa54) x19: ffffa00076c4e9a0 x20: ffff0000fec39000 (g_ctx + fd577254) x21: 2 x22: ffff0000419b6090 (g_ctx + 402f42e4) x23: ffff000000c0b137 (lockstat_enabled + 0) x24: 100 x25: ffff000000c0b000 (version + a0) x26: ffff000000c0b000 (version + a0) x27: ffff000000c0b000 (version + a0) x28: 0 x29: ffff0000fe661800 (g_ctx + fcf9fa54) sp: ffff0000fe661800 lr: ffff00000154ea50 (zio_dva_throttle + 154) elr: ffff00000154ea80 (zio_dva_throttle + 184) spsr: 60000045 far: 2b753286b0b8 panic: Unknown kernel exception 0 esr_el1 2000000 cpuid = 1 time = 1646685857 KDB: stack backtrace: db_trace_self() at db_trace_self db_trace_self_wrapper() at db_trace_self_wrapper+0x30 vpanic() at vpanic+0x174 panic() at panic+0x44 do_el1h_sync() at do_el1h_sync+0x184 handle_el1h_sync() at handle_el1h_sync+0x10 --- exception, esr 0x2000000 zio_dva_throttle() at zio_dva_throttle+0x184 zio_execute() at zio_execute+0x58 KDB: enter: panic [ thread pid 0 tid 100129 ] Stopped at kdb_enter+0x44: undefined f901c11f db> Will try the patch of Andrew next. Compilation might take a while so maybe it wil be tomorrow. Regards, Ronald. ------=_Part_172_1254189170.1646686466401 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit <html><head></head><body> <p><strong>Van:</strong> Mark Johnston <markj@freebsd.org><br /> <strong>Datum:</strong> maandag, 7 maart 2022 16:13<br /> <strong>Aan:</strong> Ronald Klop <ronald-lists@klop.ws><br /> <strong>CC:</strong> bob prohaska <fbsd@www.zefox.net>, Mark Millard <marklmi@yahoo.com>, freebsd-arm@freebsd.org, freebsd-current <freebsd-current@freebsd.org><br /> <strong>Onderwerp:</strong> Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))</p> <blockquote style="padding-right: 0px; padding-left: 5px; margin-left: 5px; border-left: #000000 2px solid; margin-right: 0px"> <div class="MessageRFC822Viewer" id="P"> <div class="TextPlainViewer" id="P.P">On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:<br /> > Dear Mark Johnston,<br /> ><br /> > I did some binary search in the kernels and came to the conclusion that <a href="https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e">https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e</a> still works and <a href="https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a">https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a</a> panics.<br /> ><br /> > I suspect your commit in <a href="https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a">https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a</a>.<br /> ><br /> > Last panic:<br /> ><br /> > panic: vm_fault failed: ffff00000046e708 error 1<br /> > cpuid = 1<br /> > time = 1646660058<br /> > KDB: stack backtrace:<br /> > db_trace_self() at db_trace_self<br /> > db_trace_self_wrapper() at db_trace_self_wrapper+0x30<br /> > vpanic() at vpanic+0x174<br /> > panic() at panic+0x44<br /> > data_abort() at data_abort+0x2e8<br /> > handle_el1h_sync() at handle_el1h_sync+0x10<br /> > --- exception, esr 0x96000004<br /> > _rm_rlock_debug() at _rm_rlock_debug+0x8c<br /> > osd_get() at osd_get+0x5c<br /> > zio_execute() at zio_execute+0xf8<br /> > taskqueue_run_locked() at taskqueue_run_locked+0x178<br /> > taskqueue_thread_loop() at taskqueue_thread_loop+0xc8<br /> > fork_exit() at fork_exit+0x74<br /> > fork_trampoline() at fork_trampoline+0x14<br /> > KDB: enter: panic<br /> > [ thread pid 0 tid 100129 ]<br /> > Stopped at kdb_enter+0x44: undefined f902011f<br /> > db><br /> ><br /> > A more recent kernel (912df91) still panics. See below.<br /> ><br /> > Do you have time to look into this? What can I provide in information to help?<br /> <br /> Hmm. So after my rmlock commits, we have the following disassembly for<br /> _rm_rlock() (with a few annotations added by me). Note that the pcpu<br /> pointer is stored in register x18 by convention.<br /> <br /> 0xffff00000046e304 <+0>: stp x29, x30, [sp, #-16]!<br /> 0xffff00000046e308 <+4>: mov x29, sp<br /> 0xffff00000046e30c <+8>: ldr x8, [x18]<br /> 0xffff00000046e310 <+12>: ldr x9, [x18]<br /> 0xffff00000046e314 <+16>: ldr x10, [x18]<br /> 0xffff00000046e318 <+20>: cmp x9, x10<br /> 0xffff00000046e31c <+24>: b.ne 0xffff00000046e3cc <_rm_rlock+200> // b.any<br /> 0xffff00000046e320 <+28>: ldr x9, [x18]<br /> 0xffff00000046e324 <+32>: ldrh w9, [x9, #314]<br /> 0xffff00000046e328 <+36>: cbnz w9, 0xffff00000046e3c0 <_rm_rlock+188><br /> 0xffff00000046e32c <+40>: str wzr, [x1, #32]<br /> 0xffff00000046e330 <+44>: stp x0, x8, [x1, #16]<br /> 0xffff00000046e334 <+48>: ldrb w9, [x0, #10]<br /> 0xffff00000046e338 <+52>: tbz w9, #4, 0xffff00000046e358 <_rm_rlock+84><br /> 0xffff00000046e33c <+56>: ldr x9, [x18]<br /> 0xffff00000046e340 <+60>: ldr w10, [x9, #888]<br /> 0xffff00000046e344 <+64>: add w10, w10, #0x1<br /> 0xffff00000046e348 <+68>: str w10, [x9, #888]<br /> 0xffff00000046e34c <+72>: ldr x9, [x18]<br /> 0xffff00000046e350 <+76>: ldr w9, [x9, #888]<br /> 0xffff00000046e354 <+80>: cbz w9, 0xffff00000046e3f4 <_rm_rlock+240><br /> 0xffff00000046e358 <+84>: ldr w9, [x8, #1212]<br /> 0xffff00000046e35c <+88>: add x10, x18, #0x90<br /> 0xffff00000046e360 <+92>: add w9, w9, #0x1<br /> 0xffff00000046e364 <+96>: str w9, [x8, #1212] <------- critical_enter<br /> 0xffff00000046e368 <+100>: str x10, [x1, #8]<br /> 0xffff00000046e36c <+104>: ldr x9, [x18, #144]<br /> 0xffff00000046e370 <+108>: str x9, [x1]<br /> 0xffff00000046e374 <+112>: str x1, [x9, #8]<br /> 0xffff00000046e378 <+116>: str x1, [x18, #144]<br /> 0xffff00000046e37c <+120>: ldr x9, [x18]<br /> 0xffff00000046e380 <+124>: ldr w10, [x9, #356]<br /> 0xffff00000046e384 <+128>: add w10, w10, #0x1<br /> 0xffff00000046e388 <+132>: str w10, [x9, #356]<br /> 0xffff00000046e38c <+136>: ldr w9, [x8, #1212]<br /> 0xffff00000046e390 <+140>: sub w9, w9, #0x1<br /> 0xffff00000046e394 <+144>: str w9, [x8, #1212] <------- critical_exit<br /> 0xffff00000046e398 <+148>: ldrb w8, [x8, #304]<br /> 0xffff00000046e39c <+152>: ldr w9, [x18, #60] <------- loading &pc->pc_cpuid<br /> ...<br /> <br /> A (the?) problem is that the compiler is treating "pc" as an alias<br /> for x18, but the rmlock code assumes that the pcpu pointer is loaded<br /> once, as it dereferences "pc" outside of the critical section. On<br /> arm64, if a context switch occurs between the store at _rm_rlock+144 and<br /> the load at +152, and the thread is migrated to another CPU, then we'll<br /> end up using the wrong CPU ID in the rm->rm_writecpus test.<br /> <br /> I suspect the problem is unique to arm64 as its get_pcpu()<br /> implementation is different from the others in that it doesn't use<br /> volatile-qualified inline assembly. This has been the case since<br /> <a href="https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762">https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762</a><br /> .<br /> <br /> I haven't been able to reproduce any crashes running poudriere in an<br /> arm64 AWS instance, though. Could you please try the patch below and<br /> confirm whether it fixes your panics? I verified that the apparent<br /> problem described above is gone with the patch.<br /> <br /> diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c<br /> index 0cdcfb8fec62..e51c25136ae0 100644<br /> --- a/sys/kern/kern_rmlock.c<br /> +++ b/sys/kern/kern_rmlock.c<br /> @@ -437,6 +437,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)<br /> {<br /> struct thread *td = curthread;<br /> struct pcpu *pc;<br /> + int cpuid;<br /> <br /> if (SCHEDULER_STOPPED())<br /> return (1);<br /> @@ -452,6 +453,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)<br /> atomic_interrupt_fence();<br /> <br /> pc = get_pcpu();<br /> + cpuid = pc->pc_cpuid;<br /> rm_tracker_add(pc, tracker);<br /> sched_pin();<br /> <br /> @@ -463,7 +465,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)<br /> * conditional jump.<br /> */<br /> if (__predict_true(0 == (td->td_owepreempt |<br /> - CPU_ISSET(pc->pc_cpuid, &rm->rm_writecpus))))<br /> + CPU_ISSET(cpuid, &rm->rm_writecpus))))<br /> return (1);<br /> <br /> /* We do not have a read token and need to acquire one. */</div> <hr /></div> </blockquote> <br /> Hi,<br /> <br /> This patch paniced again: <pre> x0: ffffa00005a31500 x1: ffffa00005a0e000 x2: 2 x3: ffffa00076c4e9a0 x4: 0 x5: e672743c8f9e5 x6: dc89f70500ab1 x7: 14 x8: ffffa00005a31518 x9: 1 x10: ffffa00005a0e000 x11: 0 x12: 0 x13: a x14: 1013e6b85a8ecbe4 x15: 1dce740d11a5 x16: ffff3ea86e2434bf x17: fffffffffffffff2 x18: ffff0000fe661800 (g_ctx + fcf9fa54) x19: ffffa00076c4e9a0 x20: ffff0000fec39000 (g_ctx + fd577254) x21: 2 x22: ffff0000419b6090 (g_ctx + 402f42e4) x23: ffff000000c0b137 (lockstat_enabled + 0) x24: 100 x25: ffff000000c0b000 (version + a0) x26: ffff000000c0b000 (version + a0) x27: ffff000000c0b000 (version + a0) x28: 0 x29: ffff0000fe661800 (g_ctx + fcf9fa54) sp: ffff0000fe661800 lr: ffff00000154ea50 (zio_dva_throttle + 154) elr: ffff00000154ea80 (zio_dva_throttle + 184) spsr: 60000045 far: 2b753286b0b8 panic: Unknown kernel exception 0 esr_el1 2000000 cpuid = 1 time = 1646685857 KDB: stack backtrace: db_trace_self() at db_trace_self db_trace_self_wrapper() at db_trace_self_wrapper+0x30 vpanic() at vpanic+0x174 panic() at panic+0x44 do_el1h_sync() at do_el1h_sync+0x184 handle_el1h_sync() at handle_el1h_sync+0x10 --- exception, esr 0x2000000 zio_dva_throttle() at zio_dva_throttle+0x184 zio_execute() at zio_execute+0x58 KDB: enter: panic [ thread pid 0 tid 100129 ] Stopped at kdb_enter+0x44: undefined f901c11f db> </pre> Will try the patch of Andrew next. Compilation might take a while so maybe it wil be tomorrow.<br /> <br /> Regards,<br /> Ronald.<br /> </body></html> ------=_Part_172_1254189170.1646686466401--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1302689164.173.1646686466515>