Date: Mon, 7 Mar 2022 10:03:51 -0800 From: Mark Millard <marklmi@yahoo.com> To: Mark Johnston <markj@FreeBSD.org>, FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org> Cc: Andrew Turner <andrew@fubar.geek.nz>, Ronald Klop <ronald-lists@klop.ws>, bob prohaska <fbsd@www.zefox.net>, Free BSD <freebsd-arm@freebsd.org>, freebsd-current <freebsd-current@freebsd.org> Subject: Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28)) Message-ID: <F25AAD14-209C-43AA-8496-8396F4C4EB76@yahoo.com> In-Reply-To: <YiY2jmD97leKev0F@nuc> References: <C2F96211-0180-45DA-872F-52358D9ED35B.ref@yahoo.com> <C2F96211-0180-45DA-872F-52358D9ED35B@yahoo.com> <1800459695.1.1646649539521@mailrelay> <132978150.92.1646660769467@mailrelay> <YiYhIQXl1sd4cOVS@nuc> <3374E0F8-D712-4ED0-A62B-B6924FC8A5E2@fubar.geek.nz> <YiY2jmD97leKev0F@nuc>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2022-Mar-7, at 08:45, Mark Johnston <markj@FreeBSD.org> wrote: > On Mon, Mar 07, 2022 at 04:25:22PM +0000, Andrew Turner wrote: >>=20 >>> On 7 Mar 2022, at 15:13, Mark Johnston <markj@freebsd.org> wrote: >>> ... >>> A (the?) problem is that the compiler is treating "pc" as an alias >>> for x18, but the rmlock code assumes that the pcpu pointer is loaded >>> once, as it dereferences "pc" outside of the critical section. On >>> arm64, if a context switch occurs between the store at _rm_rlock+144 = and >>> the load at +152, and the thread is migrated to another CPU, then = we'll >>> end up using the wrong CPU ID in the rm->rm_writecpus test. >>>=20 >>> I suspect the problem is unique to arm64 as its get_pcpu() >>> implementation is different from the others in that it doesn't use >>> volatile-qualified inline assembly. This has been the case since >>> = https://cgit.freebsd.org/src/commit/?id=3D63c858a04d56529eddbddf85ad04fc8e= 99e73762 = <https://cgit.freebsd.org/src/commit/?id=3D63c858a04d56529eddbddf85ad04fc8= e99e73762> >>> . >>>=20 >>> I haven't been able to reproduce any crashes running poudriere in an >>> arm64 AWS instance, though. Could you please try the patch below = and >>> confirm whether it fixes your panics? I verified that the apparent >>> problem described above is gone with the patch. >>=20 >> Alternatively (or additionally) we could do something like the = following. There are only a few MI users of get_pcpu with the main place = being in rm locks. >>=20 >> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h >> index 09f6361c651c..59b890e5c2ea 100644 >> --- a/sys/arm64/include/pcpu.h >> +++ b/sys/arm64/include/pcpu.h >> @@ -58,7 +58,14 @@ struct pcpu; >>=20 >> register struct pcpu *pcpup __asm ("x18"); >>=20 >> -#define get_pcpu() pcpup >> +static inline struct pcpu * >> +get_pcpu(void) >> +{ >> + struct pcpu *pcpu; >> + >> + __asm __volatile("mov %0, x18" : "=3D&r"(pcpu)); >> + return (pcpu); >> +} >>=20 >> static inline struct thread * >> get_curthread(void) >=20 > Indeed, I think this is probably the best solution. Is this just partially reverting: https://cgit.freebsd.org/src/commit/?id=3D63c858a04d56 If so, there might need to be comments about why the updated code is as it will be. Looks like stable/13 picked up sensitivity to the get_pcpu details in rmlock in: https://cgit.freebsd.org/src/commit/?h=3Dstable/13&id=3D543157870da5 (a 2022-03-04 commit) and stable/13 also has the get_pcpu misdefinition in: = https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=3Dstable/13= &id=3D63c858a04d56 . So an MFC would be appropriate in order for aarch64 to be reliable for any variations in get_pcpu in stable/13 (and for 13.1 to be so as well). =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F25AAD14-209C-43AA-8496-8396F4C4EB76>