Date: Mon, 26 Jun 2023 08:59:14 -0700 From: Mark Millard <marklmi@yahoo.com> To: John F Carr <jfc@mit.edu> Cc: Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit Message-ID: <79849041-5E0E-4244-9BA7-F7F1C673F31F@yahoo.com> In-Reply-To: <2E9684B7-9359-4A3D-A0C2-C1D2B221F2C4@mit.edu> References: <3FD359F8-CFCC-400F-B6DE-B635B747DE7F.ref@yahoo.com> <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu> <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com> <64F18C76-BD2A-4608-A8CC-38AC2820FC12@yahoo.com> <2E9684B7-9359-4A3D-A0C2-C1D2B221F2C4@mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 26, 2023, at 07:29, John F Carr <jfc@mit.edu> wrote: >=20 >=20 >> On Jun 26, 2023, at 04:32, Mark Millard <marklmi@yahoo.com> wrote: >>=20 >> On Jun 24, 2023, at 17:25, Mark Millard <marklmi@yahoo.com> wrote: >>=20 >>> On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote: >>>=20 >>>>=20 >>>>> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote: >>>>>=20 >>>>> The running system build is a non-debug build (but >>>>> with symbols not stripped). >>>>>=20 >>>>> The HoneyComb's console log shows: >>>>>=20 >>>>> . . . >>>>> GEOM_STRIPE: Device stripe.IMfBZr destroyed. >>>>> GEOM_NOP: Device md0.nop created. >>>>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D= 5 >>>>> GEOM_NOP: Device md0.nop removed. >>>>> GEOM_NOP: Device md0.nop created. >>>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D= 5 >>>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D= 5 >>>>> GEOM_NOP: Device md0.nop removed. >>>>> GEOM_NOP: Device md0.nop created. >>>>> GEOM_NOP: Device md0.nop removed. >>>>> Fatal data abort: >>>>> x0: ffffa02506e64400 >>>>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >>>>> x2: 4b >>>>> x3: a343932b0b22fb30 >>>>> x4: 0 >>>>> x5: 3310b0d062d0e1d >>>>> x6: 1d0e2d060d0b3103 >>>>> x7: 0 >>>>> x8: ea325df8 >>>>> x9: ffff0001eec946d0 ($d.6 + 0) >>>>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >>>>> x11: 0 >>>>> x12: 0 >>>>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0) >>>>> x14: 0 >>>>> x15: ffffa02506e64405 >>>>> x16: ffff0001eec94860 (_DYNAMIC + 160) >>>>> x17: ffff00000063a450 (ifc_attach_cloner + 0) >>>>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178) >>>>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0) >>>>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18) >>>>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>>>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>>>> x23: ffffa0000042e500 >>>>> x24: ffffa0000042e500 >>>>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0) >>>>> x26: ffffa0203cdef780 >>>>> x27: ffff0001eec94698 = (__set_sysinit_set_sym_if_epairmodule_sys_init + 0) >>>>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>>>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8) >>>>> sp: ffff0001eb290400 >>>>> lr: ffff0001eec82a4c ($x.1 + 3c) >>>>> elr: ffff0001eec82a60 ($x.1 + 50) >>>>> spsr: 60000045 >>>>> far: ffff0002d8fba4c8 >>>>> esr: 96000046 >>>>> panic: vm_fault failed: ffff0001eec82a60 error 1 >>>>> cpuid =3D 14 >>>>> time =3D 1687625470 >>>>> KDB: stack backtrace: >>>>> db_trace_self() at db_trace_self >>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >>>>> vpanic() at vpanic+0x13c >>>>> panic() at panic+0x44 >>>>> data_abort() at data_abort+0x2fc >>>>> handle_el1h_sync() at handle_el1h_sync+0x14 >>>>> --- exception, esr 0x96000046 >>>>> $x.1() at $x.1+0x50 >>>>> vnet_register_sysinit() at vnet_register_sysinit+0x114 >>>>> linker_load_module() at linker_load_module+0xae4 >>>>> kern_kldload() at kern_kldload+0xfc >>>>> sys_kldload() at sys_kldload+0x60 >>>>> do_el0_sync() at do_el0_sync+0x608 >>>>> handle_el0_sync() at handle_el0_sync+0x44 >>>>> --- exception, esr 0x56000000 >>>>> KDB: enter: panic >>>>> [ thread pid 70419 tid 101003 ] >>>>> Stopped at kdb_enter+0x44: str xzr, [x19, #3200] >>>>> db>=20 >>>>=20 >>>> The failure appears to be initializing module if_epair. >>>=20 >>> Yep: trying: >>>=20 >>> # kldload if_epair.ko >>>=20 >>> was enough to cause the crash. (Just a HoneyComb context at >>> that point.) >>>=20 >>> I tried media dd'd from the recent main snapshot, booting the >>> same system. No crash. I moved my build boot media to some >>> other systems and tested them: crashes. I tried my boot media >>> built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C >>> instead of Cortex-A72: no crashes. (But only one system can >>> use the X1C/A78C code in that build.) >>>=20 >>> So variation testing only gets the crashes for my builds >>> that are code-optimized for Cortex-A72's. The same source >>> tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C >>> optimization does not get the crashes. But I also >>> demonstrated an optmized for Cortex-A72 build from 2023-Mar >>> that gets the crash. >>>=20 >>> The last time I ran into one of these "crashes tied to >>> cortex-a72 code optimization" examples it turned out to be >>> some missing memory-model management code in FreeBSD's USB >>> code. But being lucky enough to help identify a FreeBSD >>> source code problem again seems not that likely. It could >>> easily be a code generation error by clang for all I know. >>>=20 >>> So, unless at some point I produce fairly solid evidence >>> that the code actually running is messed up by FreeBSD >>> source code, this should likely be treated as "blame the >>> operator" and should likely be largely ignored as things >>> are. (Just My Problem, as I want the Cortex-A72 optimized >>> builds.) >>=20 >> Turns out that the source code in question is the >> assignment to V_epair_cloner below: >>=20 >> static void >> vnet_epair_init(const void *unused __unused) >> { >> struct if_clone_addreq req =3D { >> .match_f =3D epair_clone_match, >> .create_f =3D epair_clone_create, >> .destroy_f =3D epair_clone_destroy, >> }; >> V_epair_cloner =3D ifc_attach_cloner(epairname, &req); >> } >> VNET_SYSINIT(vnet_epair_init, SI_SUB_PSEUDO, SI_ORDER_ANY, >> vnet_epair_init, NULL); >>=20 >> Example code when not optimizing for the Cortex-A72: >>=20 >> 11a4c: d0000089 adrp x9, 0x23000 >> 11a50: f9400248 ldr x8, [x18] >> 11a54: f942c508 ldr x8, [x8, #1416] >> 11a58: f943d929 ldr x9, [x9, #1968] >> 11a5c: a9437bfd ldp x29, x30, [sp, #48] >> 11a60: f9401508 ldr x8, [x8, #40] >> 11a64: f8296900 str x0, [x8, x9] >>=20 >> The code when optmizing for the Cortex-A72: >>=20 >> 11a4c: f9400248 ldr x8, [x18] >> 11a50: f942c508 ldr x8, [x8, #1416] >> 11a54: d503201f nop >> 11a58: 1008e3c9 adr x9, #72824 >> 11a5c: f9401508 ldr x8, [x8, #40] >> 11a60: f8296900 str x0, [x8, x9] >> 11a64: a9437bfd ldp x29, x30, [sp, #48] >>=20 >> It is the "str x0, [x8, x9]" that vm_fault's for >> the optimized code. >>=20 >> So: >>=20 >> 11a4c: d0000089 adrp x9, 0x23000 >> 11a58: f943d929 ldr x9, [x9, #1968] >>=20 >> was optimized via replacement by: >>=20 >> 11a58: 1008e3c9 adr x9, #72824 >>=20 >> I.e., the optimization is based on the offset from >> the instruction being fixed in order to produce the >> value in x9, even if the instruction is relocated. >>=20 >> This resulted in the specific x9 value shown in >> the x8/x9 pair: >>=20 >> x8: ea325df8 >> x9: ffff0001eec946d0 >>=20 >> which total's to the fault address (value >> in far): >>=20 >> far: ffff0002d8fba4c8 >>=20 >>=20 > Is this the same as bug 264094? >=20 > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D264094 Well, the not Cortex-A72 optimized .o stage code vs. the Cortex-A72 optimized .o stage code looks like: (not Cortex-A72 optimized) 3c: 90000009 adrp x9, 0x0 <vnet_epair_init+0x3c> 40: f9400248 ldr x8, [x18] 44: f942c508 ldr x8, [x8, #1416] 48: f9400129 ldr x9, [x9] 4c: a9437bfd ldp x29, x30, [sp, #48] 50: f9401508 ldr x8, [x8, #40] 54: f8296900 str x0, [x8, x9] vs. (Cortex-A72 optimized) 3c: f9400248 ldr x8, [x18] 40: f942c508 ldr x8, [x8, #1416] 44: 90000009 adrp x9, 0x0 <vnet_epair_init+0x44> 48: f9400129 ldr x9, [x9] 4c: f9401508 ldr x8, [x8, #40] 50: f8296900 str x0, [x8, x9] 54: a9437bfd ldp x29, x30, [sp, #48] (The x29 lines have a different purpose but I show the sequencing as shown by objdump to show that it is basically an ordering difference at the .o stage.) As for if_epair.kld production the .meta files show: CMD ld -m aarch64elf -warn-common --build-id=3Dsha1 -r -o if_epair.kld = if_epair.o CMD ctfmerge -L VERSION -g -o if_epair.kld if_epair.o CMD :> export_syms CMD awk -f /usr/main-src/sys/conf/kmod_syms.awk if_epair.kld = export_syms | xargs -J% objcopy % if_epair.kld CWD = /usr/obj/BUILDs/main-CA72-nodbg-clang-alt/usr/main-src/arm64.aarch64/sys/G= ENERIC-NODBG-CA72/modules/usr/main-src/sys/modules/if_epair vs. CMD ld -m aarch64elf -warn-common --build-id=3Dsha1 -r -o if_epair.kld = if_epair.o CMD ctfmerge -L VERSION -g -o if_epair.kld if_epair.o CMD :> export_syms CMD awk -f /usr/main-src/sys/conf/kmod_syms.awk if_epair.kld = export_syms | xargs -J% objcopy % if_epair.kld CWD = /usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENER= IC-NODBG-CA72/modules/usr/main-src/sys/modules/if_epair It looks to me like the code ordering differences in the .o's may be all that lead to the differing .kld results for setting x9 . If so, it is not good to be that dependent on minor .o stage code generation differences for if things will be operational vs. not. =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?79849041-5E0E-4244-9BA7-F7F1C673F31F>