Date: Mon, 26 Jun 2023 01:32:03 -0700 From: Mark Millard <marklmi@yahoo.com> To: John F Carr <jfc@mit.edu> Cc: Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit Message-ID: <64F18C76-BD2A-4608-A8CC-38AC2820FC12@yahoo.com> In-Reply-To: <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com> References: <3FD359F8-CFCC-400F-B6DE-B635B747DE7F.ref@yahoo.com> <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu> <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 24, 2023, at 17:25, Mark Millard <marklmi@yahoo.com> wrote: > On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote: >=20 >>=20 >>> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote: >>>=20 >>> The running system build is a non-debug build (but >>> with symbols not stripped). >>>=20 >>> The HoneyComb's console log shows: >>>=20 >>> . . . >>> GEOM_STRIPE: Device stripe.IMfBZr destroyed. >>> GEOM_NOP: Device md0.nop created. >>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D = 5 >>> GEOM_NOP: Device md0.nop removed. >>> GEOM_NOP: Device md0.nop created. >>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D = 5 >>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D = 5 >>> GEOM_NOP: Device md0.nop removed. >>> GEOM_NOP: Device md0.nop created. >>> GEOM_NOP: Device md0.nop removed. >>> Fatal data abort: >>> x0: ffffa02506e64400 >>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >>> x2: 4b >>> x3: a343932b0b22fb30 >>> x4: 0 >>> x5: 3310b0d062d0e1d >>> x6: 1d0e2d060d0b3103 >>> x7: 0 >>> x8: ea325df8 >>> x9: ffff0001eec946d0 ($d.6 + 0) >>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >>> x11: 0 >>> x12: 0 >>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0) >>> x14: 0 >>> x15: ffffa02506e64405 >>> x16: ffff0001eec94860 (_DYNAMIC + 160) >>> x17: ffff00000063a450 (ifc_attach_cloner + 0) >>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178) >>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0) >>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18) >>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>> x23: ffffa0000042e500 >>> x24: ffffa0000042e500 >>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0) >>> x26: ffffa0203cdef780 >>> x27: ffff0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init = + 0) >>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8) >>> sp: ffff0001eb290400 >>> lr: ffff0001eec82a4c ($x.1 + 3c) >>> elr: ffff0001eec82a60 ($x.1 + 50) >>> spsr: 60000045 >>> far: ffff0002d8fba4c8 >>> esr: 96000046 >>> panic: vm_fault failed: ffff0001eec82a60 error 1 >>> cpuid =3D 14 >>> time =3D 1687625470 >>> KDB: stack backtrace: >>> db_trace_self() at db_trace_self >>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >>> vpanic() at vpanic+0x13c >>> panic() at panic+0x44 >>> data_abort() at data_abort+0x2fc >>> handle_el1h_sync() at handle_el1h_sync+0x14 >>> --- exception, esr 0x96000046 >>> $x.1() at $x.1+0x50 >>> vnet_register_sysinit() at vnet_register_sysinit+0x114 >>> linker_load_module() at linker_load_module+0xae4 >>> kern_kldload() at kern_kldload+0xfc >>> sys_kldload() at sys_kldload+0x60 >>> do_el0_sync() at do_el0_sync+0x608 >>> handle_el0_sync() at handle_el0_sync+0x44 >>> --- exception, esr 0x56000000 >>> KDB: enter: panic >>> [ thread pid 70419 tid 101003 ] >>> Stopped at kdb_enter+0x44: str xzr, [x19, #3200] >>> db>=20 >>=20 >> The failure appears to be initializing module if_epair. >=20 > Yep: trying: >=20 > # kldload if_epair.ko >=20 > was enough to cause the crash. (Just a HoneyComb context at > that point.) >=20 > I tried media dd'd from the recent main snapshot, booting the > same system. No crash. I moved my build boot media to some > other systems and tested them: crashes. I tried my boot media > built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C > instead of Cortex-A72: no crashes. (But only one system can > use the X1C/A78C code in that build.) >=20 > So variation testing only gets the crashes for my builds > that are code-optimized for Cortex-A72's. The same source > tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C > optimization does not get the crashes. But I also > demonstrated an optmized for Cortex-A72 build from 2023-Mar > that gets the crash. >=20 > The last time I ran into one of these "crashes tied to > cortex-a72 code optimization" examples it turned out to be > some missing memory-model management code in FreeBSD's USB > code. But being lucky enough to help identify a FreeBSD > source code problem again seems not that likely. It could > easily be a code generation error by clang for all I know. >=20 > So, unless at some point I produce fairly solid evidence > that the code actually running is messed up by FreeBSD > source code, this should likely be treated as "blame the > operator" and should likely be largely ignored as things > are. (Just My Problem, as I want the Cortex-A72 optimized > builds.) Turns out that the source code in question is the assignment to V_epair_cloner below: static void vnet_epair_init(const void *unused __unused) { struct if_clone_addreq req =3D { .match_f =3D epair_clone_match, .create_f =3D epair_clone_create, .destroy_f =3D epair_clone_destroy, }; V_epair_cloner =3D ifc_attach_cloner(epairname, &req); } VNET_SYSINIT(vnet_epair_init, SI_SUB_PSEUDO, SI_ORDER_ANY, vnet_epair_init, NULL); Example code when not optimizing for the Cortex-A72: 11a4c: d0000089 adrp x9, 0x23000 11a50: f9400248 ldr x8, [x18] 11a54: f942c508 ldr x8, [x8, #1416] 11a58: f943d929 ldr x9, [x9, #1968] 11a5c: a9437bfd ldp x29, x30, [sp, #48] 11a60: f9401508 ldr x8, [x8, #40] 11a64: f8296900 str x0, [x8, x9] The code when optmizing for the Cortex-A72: 11a4c: f9400248 ldr x8, [x18] 11a50: f942c508 ldr x8, [x8, #1416] 11a54: d503201f nop 11a58: 1008e3c9 adr x9, #72824 11a5c: f9401508 ldr x8, [x8, #40] 11a60: f8296900 str x0, [x8, x9] 11a64: a9437bfd ldp x29, x30, [sp, #48] It is the "str x0, [x8, x9]" that vm_fault's for the optimized code. So: 11a4c: d0000089 adrp x9, 0x23000 11a58: f943d929 ldr x9, [x9, #1968] was optimized via replacement by: 11a58: 1008e3c9 adr x9, #72824 I.e., the optimization is based on the offset from the instruction being fixed in order to produce the value in x9, even if the instruction is relocated. This resulted in the specific x9 value shown in the x8/x9 pair: x8: ea325df8 x9: ffff0001eec946d0 which total's to the fault address (value in far): far: ffff0002d8fba4c8 > Sorry for the noise. >=20 >> I see no recent changes in that module that would be likely to break = initialization. >>=20 >> a9bfd080d09a if_epair: do not transmit packets that exceed the = interface MTU >> 4d846d260e2b spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, = drop -FreeBSD >> a6b55ee6be15 net: replace IFF_KNOWSEPOCH with IFF_NEEDSEPOCH >> c69ae8419734 if_epair: also remove vlan metadata from mbufs >> 29c9b1673305 epair: Remove unneeded includes and sort some of the = rest >=20 > My kyua run examples included a Cortex-A72 optimized system build > from last 2023-Mar. It also crashes. It looks like my last kyua > runs were back in 2022-Jan or so, associated with some ASAN and > UBSAN experiments --and so would have been on amd64, not aarch64. > Otherwise any aarch64 ones would be even older. I've no useful > narrowing of the potential time frame for the problem starting. =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?64F18C76-BD2A-4608-A8CC-38AC2820FC12>