Date: Sat, 24 Jun 2023 17:25:35 -0700 From: Mark Millard <marklmi@yahoo.com> To: John F Carr <jfc@mit.edu> Cc: Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit Message-ID: <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com> In-Reply-To: <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu> References: <3FD359F8-CFCC-400F-B6DE-B635B747DE7F.ref@yahoo.com> <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote: >=20 >> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote: >>=20 >> The running system build is a non-debug build (but >> with symbols not stripped). >>=20 >> The HoneyComb's console log shows: >>=20 >> . . . >> GEOM_STRIPE: Device stripe.IMfBZr destroyed. >> GEOM_NOP: Device md0.nop created. >> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D = 5 >> GEOM_NOP: Device md0.nop removed. >> GEOM_NOP: Device md0.nop created. >> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D = 5 >> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D = 5 >> GEOM_NOP: Device md0.nop removed. >> GEOM_NOP: Device md0.nop created. >> GEOM_NOP: Device md0.nop removed. >> Fatal data abort: >> x0: ffffa02506e64400 >> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >> x2: 4b >> x3: a343932b0b22fb30 >> x4: 0 >> x5: 3310b0d062d0e1d >> x6: 1d0e2d060d0b3103 >> x7: 0 >> x8: ea325df8 >> x9: ffff0001eec946d0 ($d.6 + 0) >> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >> x11: 0 >> x12: 0 >> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0) >> x14: 0 >> x15: ffffa02506e64405 >> x16: ffff0001eec94860 (_DYNAMIC + 160) >> x17: ffff00000063a450 (ifc_attach_cloner + 0) >> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178) >> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0) >> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18) >> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >> x23: ffffa0000042e500 >> x24: ffffa0000042e500 >> x25: ffff000000ce0788 (linker_lookup_set_desc + 0) >> x26: ffffa0203cdef780 >> x27: ffff0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init = + 0) >> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8) >> sp: ffff0001eb290400 >> lr: ffff0001eec82a4c ($x.1 + 3c) >> elr: ffff0001eec82a60 ($x.1 + 50) >> spsr: 60000045 >> far: ffff0002d8fba4c8 >> esr: 96000046 >> panic: vm_fault failed: ffff0001eec82a60 error 1 >> cpuid =3D 14 >> time =3D 1687625470 >> KDB: stack backtrace: >> db_trace_self() at db_trace_self >> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >> vpanic() at vpanic+0x13c >> panic() at panic+0x44 >> data_abort() at data_abort+0x2fc >> handle_el1h_sync() at handle_el1h_sync+0x14 >> --- exception, esr 0x96000046 >> $x.1() at $x.1+0x50 >> vnet_register_sysinit() at vnet_register_sysinit+0x114 >> linker_load_module() at linker_load_module+0xae4 >> kern_kldload() at kern_kldload+0xfc >> sys_kldload() at sys_kldload+0x60 >> do_el0_sync() at do_el0_sync+0x608 >> handle_el0_sync() at handle_el0_sync+0x44 >> --- exception, esr 0x56000000 >> KDB: enter: panic >> [ thread pid 70419 tid 101003 ] >> Stopped at kdb_enter+0x44: str xzr, [x19, #3200] >> db>=20 >=20 > The failure appears to be initializing module if_epair. Yep: trying: # kldload if_epair.ko was enough to cause the crash. (Just a HoneyComb context at that point.) I tried media dd'd from the recent main snapshot, booting the same system. No crash. I moved my build boot media to some other systems and tested them: crashes. I tried my boot media built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C instead of Cortex-A72: no crashes. (But only one system can use the X1C/A78C code in that build.) So variation testing only gets the crashes for my builds that are code-optimized for Cortex-A72's. The same source tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C optimization does not get the crashes. But I also demonstrated an optmized for Cortex-A72 build from 2023-Mar that gets the crash. The last time I ran into one of these "crashes tied to cortex-a72 code optimization" examples it turned out to be some missing memory-model management code in FreeBSD's USB code. But being lucky enough to help identify a FreeBSD source code problem again seems not that likely. It could easily be a code generation error by clang for all I know. So, unless at some point I produce fairly solid evidence that the code actually running is messed up by FreeBSD source code, this should likely be treated as "blame the operator" and should likely be largely ignored as things are. (Just My Problem, as I want the Cortex-A72 optimized builds.) Sorry for the noise. > I see no recent changes in that module that would be likely to break = initialization. >=20 > a9bfd080d09a if_epair: do not transmit packets that exceed the = interface MTU > 4d846d260e2b spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, = drop -FreeBSD > a6b55ee6be15 net: replace IFF_KNOWSEPOCH with IFF_NEEDSEPOCH > c69ae8419734 if_epair: also remove vlan metadata from mbufs > 29c9b1673305 epair: Remove unneeded includes and sort some of the rest My kyua run examples included a Cortex-A72 optimized system build from last 2023-Mar. It also crashes. It looks like my last kyua runs were back in 2022-Jan or so, associated with some ASAN and UBSAN experiments --and so would have been on amd64, not aarch64. Otherwise any aarch64 ones would be even older. I've no useful narrowing of the potential time frame for the problem starting. =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4A380699-7C9E-4E2E-8DCD-F9ECC2112667>