Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 7 Mar 2022 06:37:46 -0800
From:      Mark Millard <marklmi@yahoo.com>
To:        Ronald Klop <ronald-lists@klop.ws>, Mark Johnston <markj@FreeBSD.org>
Cc:        bob prohaska <fbsd@www.zefox.net>, Free BSD <freebsd-arm@freebsd.org>, freebsd-current <freebsd-current@freebsd.org>
Subject:   Re: panic: data abort in critical section or under mutex  (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))
Message-ID:  <10724FB9-8E75-4DB7-A0F4-CFF55D21272B@yahoo.com>
In-Reply-To: <132978150.92.1646660769467@mailrelay>
References:  <C2F96211-0180-45DA-872F-52358D9ED35B.ref@yahoo.com> <C2F96211-0180-45DA-872F-52358D9ED35B@yahoo.com> <1800459695.1.1646649539521@mailrelay> <132978150.92.1646660769467@mailrelay>

next in thread | previous in thread | raw e-mail | index | archive | help

On 2022-Mar-7, at 05:46, Ronald Klop <ronald-lists@klop.ws> wrote:

> Dear Mark Johnston,
>=20
> I did some binary search in the kernels and came to the conclusion =
that =
https://cgit.freebsd.org/src/commit/?id=3D1517b8d5a7f58897200497811de1b188=
09c07d3e still works and =
https://cgit.freebsd.org/src/commit/?id=3D407c34e735b5d17e2be574808a09e6d7=
29b0a45a panics.
>=20
> I suspect your commit in =
https://cgit.freebsd.org/src/commit/?id=3Dc84bb8cd771ce4bed58152e47a32dda4=
70bef23a.
>=20
> Last panic:
>=20
> panic: vm_fault failed: ffff00000046e708 error 1
> cpuid =3D 1
> time =3D 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x96000004
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at      kdb_enter+0x44: undefined       f902011f
> db>

Was this a WITNESS/DEBUG kernel? Non-WITNESS? Non-debug?

Which aarch64 variant? Bob's was Cortex-A53 (RPi3).

> A more recent kernel (912df91) still panics. See below.
>=20
> Do you have time to look into this? What can I provide in information =
to help?
>=20
> Regards,
> Ronald.
>=20
> Van: Ronald Klop <ronald-lists@klop.ws>
> Datum: maandag, 7 maart 2022 11:38
> Aan: Mark Millard <marklmi@yahoo.com>
> CC: bob prohaska <fbsd@www.zefox.net>, freebsd-current =
<freebsd-current@freebsd.org>, freebsd-arm@freebsd.org
> Onderwerp: Re: panic: data abort in critical section or under mutex =
(was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on =
14-CURRENT/aarch64 Feb 28))
>=20
> Yes, I spoke to soon too. Often it crashes as soon as I start a =
parallel poudriere build. But this time it went very far. As soon as =
nightly backups kicked in it was game over again.
> I had read the mail of Bob on the arm@ ML. But I wanted to let the =
conclusion that it is about the same problem to the developers. (Have =
seen enough of wrong guessing of causes in my work. )
>=20
> I will need to go further into the binary search of working kernels.
>=20
> This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC =
2022
> Fatal data abort:                                                      =
                                                            =20
>   x0: ffff000000f1efd8  x0: ffff000000f1efd8 (mac_policy_rm + 0) =
(mac_policy_rm + 0)                                              =20
>                                                                        =
                                                            =20
>   x1:                2  x1:                2                           =
                                                           =20
>                                                                        =
                                                            =20
>   x2: ffff00000087dcf2  x2: ffff00000087dcf2 (cam_status_table + =
2f28a)                                                            =20
>  (cam_status_table + 2f28a)  x3: ffff00000087dcf2                      =
                                                            =20
>   x3: ffff00000087dcf2 (cam_status_table + 2f28a) (cam_status_table + =
2f28a)                                                      =20
>                                                                        =
                                                            =20
>   x4:              102  x4:              102                           =
                                                           =20
>                                                                        =
                                                            =20
>   x5:                7  x5:                1                           =
                                                           =20
>                                                                        =
                                                            =20
>   x6:                0  x6:               ff                           =
                                                           =20
>                                                                        =
                                                            =20
>   x7:                0  x7: ffffa00011fc2800                           =
                                                           =20
>   x8:                1                                                 =
                                                           =20
>                                                                        =
                                                            =20
>   x8:                1  x9: ffff000000f37c10                           =
                                                           =20
>   x9: ffff0000419d9090 (pcpu0 + 90) (g_ctx + 40278fe4)                 =
                                                           =20
>                                                                        =
                                                            =20
>  x10: ffffa0017be2a600 x10: ffffa000010fa600                           =
                                                           =20
>  x11: 394aed08d0003a48                                                 =
                                                           =20
>                                                                        =
                                                            =20
>  x12: 350001a8b946a108 x11:                0                           =
                                                           =20
>                                                                        =
                                                            =20
>  x12: ffff000000f37c10 x13:         badecce4 (pcpu0 + 90)              =
                                                            =20
>                                                                        =
                                                            =20
>  x13: ffffa0001fbde6b0 x14:                0                           =
                                                           =20
>                                                                        =
                                                            =20
>  x14:         4965ae49 x15:                1                           =
                                                           =20
>                                                                        =
                                                            =20
>  x15:          1000193 x16: ffff0000016a4238                           =
                                                           =20
>  x16: ffff000100482d38 (__stop_set_modmetadata_set + d00) =
(__stop_set_modmetadata_set + 448)                                      =20=

>                                                                        =
                                                            =20
>  x17: ffff00000044a998 x17: ffff00000058ff30 (free + 0) =
(if_inc_counter + 0)                                                     =
 =20
>                                                                        =
                                                            =20
>  x18: ffff0000b49a23c0 x18: ffff000103f11b80 (g_ctx + b3242314)        =
                                                            =20
>  (next_index + 3a228c0) x19:              102                          =
                                                            =20
>=20
>                                                                        =
                                                           =20
>  x19:              102 x20: ffff0000b49a2428                           =
                                                           =20
>  x20: ffff000103f11be8 (g_ctx + b324237c) (next_index + 3a22928)       =
                                                           =20
>=20
>  x21: ffff00000087dcf2 x21: ffff00000087dcf2 (cam_status_table + =
2f28a) (cam_status_table + 2f28a)
>=20
>  x22: ffff000000f1efd8 x22: ffff000000f1efd8 (mac_policy_rm + 0) =
(mac_policy_rm + 0)
>=20
>  x23: ffff00000086f107 x23:                0 (cam_status_table + =
2069f)
>=20
>  x24: ffffa0001fbde6c8 x24: ffffa0008cba0d00
>  x25:                0
>=20
>  x25: ffff00000088aa11 x26:                4 =
(do_execve.fexecv_proc_title + 76b7)
>=20
>  x27:                0 x26: ffffa0017be2a600
>  x28: ffff00010209fcf0
>  x27: ffffa00025626a80 (next_index + 1bb0a30)
>=20
>  x28: ffff000103f11ce0 x29: ffff0000b49a23e0 (next_index + 3a22a20) =
(g_ctx + b3242334)
>=20
>  x29: ffff000103f11ba0  sp: ffff0000b49a23c0
>  (next_index + 3a228e0)  lr: ffff00000046ef98
>   sp: ffff000103f11b80
>  (_rm_runlock_debug + 60)  lr: ffff00000046ef98
>  elr: ffff00000046dc0c (_rm_runlock_debug + 60) (_rm_assert + a4)
>=20
>  elr: ffff00000046dc0cspsr:               45
>  (_rm_assert + a4) far:               10
>=20
>  esr:         96000004
> spsr:               45
>=20
> panic: data abort in critical section or under mutex
> cpuid =3D 1
> time =3D 1646609483
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2d4
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x96000004
> _rm_assert() at _rm_assert+0xa4
> _rm_runlock_debug() at _rm_runlock_debug+0x5c
> mac_inpcb_check_deliver() at mac_inpcb_check_deliver+0x74
> tcp_input_with_port() at tcp_input_with_port+0xab4
> tcp_input() at tcp_input+0xc
> ip_input() at ip_input+0x2e8
> netisr_dispatch_src() at netisr_dispatch_src+0xe4
> ether_demux() at ether_demux+0x178
> ether_nh_input() at ether_nh_input+0x3e8
> netisr_dispatch_src() at netisr_dispatch_src+0xe4
> ether_input() at ether_input+0x80
> if_input() at if_input+0xc
> gen_intr() at gen_intr+0x444
> ithread_loop() at ithread_loop+0x2a0
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 12 tid 100063 ]
> Stopped at      kdb_enter+0x44: undefined       f902011f
> db>
>=20
>=20
> NB: db> reboot/reset/halt does not work on my RPI4. Luckily I have a =
wifi connected power switch on it.
>=20
> Regards,
> Ronald.
>=20
> Van: Mark Millard <marklmi@yahoo.com>
> Datum: maandag, 7 maart 2022 02:01
> Aan: Ronald Klop <ronald-lists@klop.ws>
> CC: freebsd-current <freebsd-current@freebsd.org>, bob prohaska =
<fbsd@www.zefox.net>
> Onderwerp: Re: panic: data abort in critical section or under mutex =
(was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on =
14-CURRENT/aarch64 Feb 28))
>=20
> From: Ronald Klop <ronald-lists_at_klop.ws> wrote on
> Date: Sun, 6 Mar 2022 23:22:42 +0100 (CET) :
>=20
> > Did some binary search with kernels from artifact.ci.freebsd.org.
> >
> > I suspect "rmlock: Micro-optimize read locking" as cause.
> >
> > =
https://cgit.freebsd.org/src/commit/?id=3Dc84bb8cd771ce4bed58152e47a32dda4=
70bef23a
> >
> >
> > And "rmlock: Add required compiler barriers to _rm_runlock()" as =
solution.
> >
> > =
https://cgit.freebsd.org/src/commit/?id=3D89ae8eb74e87ac19aa2d7abe4ba16bcc=
cd32bb9f
> >
> >
> > So I probably just had a bad day.
>=20
> Well, there is a report of a buildkernel crash after that pair:
>=20
> https://lists.freebsd.org/archives/freebsd-arm/2022-March/001078.html
>=20
> that references additional information at:
>=20
> http://www.zefox.net/~fbsd/rpi3/crashes/20220304/readme
>=20
> and reported:
>=20
> QUOTE
> The console connection dropped before the crash (unrelated) I didn't
> get the preamble, all  I have is the backtrace and buildkernel log.
> Here's the backtrace:
> db> bt
> Tracing pid 14795 tid 100098 td 0xffffa00017815600
> db_trace_self() at db_trace_self
> db_stack_trace() at db_stack_trace+0x11c
> db_command() at db_command+0x368
> db_command_loop() at db_command_loop+0x54
> db_trap() at db_trap+0xf8
> kdb_trap() at kdb_trap+0x1cc
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0xf2000000
> kdb_enter() at kdb_enter+0x44
> vpanic() at vpanic+0x1b0
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x96000004
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> sysctl_root_handler_locked() at sysctl_root_handler_locked+0x140
> sysctl_root() at sysctl_root+0x1ac
> userland_sysctl() at userland_sysctl+0x140
> sys___sysctl() at sys___sysctl+0x68
> do_el0_sync() at do_el0_sync+0x520
> handle_el0_sync() at handle_el0_sync+0x40
> --- exception, esr 0x56000000
> END QUOTE

This was a WITNESS and debug kernel as I understand.

Also, this was a RPi3, so Cortex-A53, that has
in-order-execution cores. (Unlike Cortex-A72's,
for example).

> The above material does reference _rm_rlock_debug . Might be
> related?
>=20
> The readme reports:
>=20
> main-n253603-0b25cbc79d3: Thu Mar  3 22:48:31 PST 2022
>=20
> for the system doing the buildkernel. This is after
> 89ae8eb74e8 .
>=20
> (It also mentions another panic earlier in the week,
> apparently not reported to the lists at the time.)
>=20

So far as I have noticed, all reports of the crashes in
_rm_rlock_debug are on aarch64 hardware. So may be the
problem is tied to the weak memory model --but for something
that matters to a Cortex-A53's executes-in-order cores?
(Just athought.) But, then, the constrasting(?) status of
powerpc64 might be of note. (And I'll stop guessing here.)

I do not know if any non-WITNESS/non-debug kernel builds
have failed.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?10724FB9-8E75-4DB7-A0F4-CFF55D21272B>