Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 26 Jun 2023 14:29:04 +0000
From:      John F Carr <jfc@mit.edu>
To:        Mark Millard <marklmi@yahoo.com>
Cc:        Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit
Message-ID:  <2E9684B7-9359-4A3D-A0C2-C1D2B221F2C4@mit.edu>
In-Reply-To: <64F18C76-BD2A-4608-A8CC-38AC2820FC12@yahoo.com>
References:  <3FD359F8-CFCC-400F-B6DE-B635B747DE7F.ref@yahoo.com> <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu> <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com> <64F18C76-BD2A-4608-A8CC-38AC2820FC12@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Jun 26, 2023, at 04:32, Mark Millard <marklmi@yahoo.com> wrote:
>=20
> On Jun 24, 2023, at 17:25, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote:
>>=20
>>>=20
>>>> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote:
>>>>=20
>>>> The running system build is a non-debug build (but
>>>> with symbols not stripped).
>>>>=20
>>>> The HoneyComb's console log shows:
>>>>=20
>>>> . . .
>>>> GEOM_STRIPE: Device stripe.IMfBZr destroyed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D 5
>>>> GEOM_NOP: Device md0.nop removed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D 5
>>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D 5
>>>> GEOM_NOP: Device md0.nop removed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> GEOM_NOP: Device md0.nop removed.
>>>> Fatal data abort:
>>>> x0: ffffa02506e64400
>>>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>>>> x2:               4b
>>>> x3: a343932b0b22fb30
>>>> x4:                0
>>>> x5:  3310b0d062d0e1d
>>>> x6: 1d0e2d060d0b3103
>>>> x7:                0
>>>> x8:         ea325df8
>>>> x9: ffff0001eec946d0 ($d.6 + 0)
>>>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>>>> x11:                0
>>>> x12:                0
>>>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0)
>>>> x14:                0
>>>> x15: ffffa02506e64405
>>>> x16: ffff0001eec94860 (_DYNAMIC + 160)
>>>> x17: ffff00000063a450 (ifc_attach_cloner + 0)
>>>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178)
>>>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0)
>>>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18)
>>>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x23: ffffa0000042e500
>>>> x24: ffffa0000042e500
>>>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0)
>>>> x26: ffffa0203cdef780
>>>> x27: ffff0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init +=
 0)
>>>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8)
>>>> sp: ffff0001eb290400
>>>> lr: ffff0001eec82a4c ($x.1 + 3c)
>>>> elr: ffff0001eec82a60 ($x.1 + 50)
>>>> spsr:         60000045
>>>> far: ffff0002d8fba4c8
>>>> esr:         96000046
>>>> panic: vm_fault failed: ffff0001eec82a60 error 1
>>>> cpuid =3D 14
>>>> time =3D 1687625470
>>>> KDB: stack backtrace:
>>>> db_trace_self() at db_trace_self
>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>>> vpanic() at vpanic+0x13c
>>>> panic() at panic+0x44
>>>> data_abort() at data_abort+0x2fc
>>>> handle_el1h_sync() at handle_el1h_sync+0x14
>>>> --- exception, esr 0x96000046
>>>> $x.1() at $x.1+0x50
>>>> vnet_register_sysinit() at vnet_register_sysinit+0x114
>>>> linker_load_module() at linker_load_module+0xae4
>>>> kern_kldload() at kern_kldload+0xfc
>>>> sys_kldload() at sys_kldload+0x60
>>>> do_el0_sync() at do_el0_sync+0x608
>>>> handle_el0_sync() at handle_el0_sync+0x44
>>>> --- exception, esr 0x56000000
>>>> KDB: enter: panic
>>>> [ thread pid 70419 tid 101003 ]
>>>> Stopped at      kdb_enter+0x44: str     xzr, [x19, #3200]
>>>> db>=20
>>>=20
>>> The failure appears to be initializing module if_epair.
>>=20
>> Yep: trying:
>>=20
>> # kldload if_epair.ko
>>=20
>> was enough to cause the crash. (Just a HoneyComb context at
>> that point.)
>>=20
>> I tried media dd'd from the recent main snapshot, booting the
>> same system. No crash. I moved my build boot media to some
>> other systems and tested them: crashes. I tried my boot media
>> built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C
>> instead of Cortex-A72: no crashes. (But only one system can
>> use the X1C/A78C code in that build.)
>>=20
>> So variation testing only gets the crashes for my builds
>> that are code-optimized for Cortex-A72's. The same source
>> tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C
>> optimization does not get the crashes. But I also
>> demonstrated an optmized for Cortex-A72 build from 2023-Mar
>> that gets the crash.
>>=20
>> The last time I ran into one of these "crashes tied to
>> cortex-a72 code optimization" examples it turned out to be
>> some missing memory-model management code in FreeBSD's USB
>> code. But being lucky enough to help identify a FreeBSD
>> source code problem again seems not that likely. It could
>> easily be a code generation error by clang for all I know.
>>=20
>> So, unless at some point I produce fairly solid evidence
>> that the code actually running is messed up by FreeBSD
>> source code, this should likely be treated as "blame the
>> operator" and should likely be largely ignored as things
>> are. (Just My Problem, as I want the Cortex-A72 optimized
>> builds.)
>=20
> Turns out that the source code in question is the
> assignment to V_epair_cloner below:
>=20
> static void
> vnet_epair_init(const void *unused __unused)
> {
>        struct if_clone_addreq req =3D {
>                .match_f =3D epair_clone_match,
>                .create_f =3D epair_clone_create,
>                .destroy_f =3D epair_clone_destroy,
>        };
>        V_epair_cloner =3D ifc_attach_cloner(epairname, &req);
> }
> VNET_SYSINIT(vnet_epair_init, SI_SUB_PSEUDO, SI_ORDER_ANY,
>    vnet_epair_init, NULL);
>=20
> Example code when not optimizing for the Cortex-A72:
>=20
>   11a4c: d0000089      adrp    x9, 0x23000
>   11a50: f9400248      ldr     x8, [x18]
>   11a54: f942c508      ldr     x8, [x8, #1416]
>   11a58: f943d929      ldr     x9, [x9, #1968]
>   11a5c: a9437bfd      ldp     x29, x30, [sp, #48]
>   11a60: f9401508      ldr     x8, [x8, #40]
>   11a64: f8296900      str     x0, [x8, x9]
>=20
> The code when optmizing for the Cortex-A72:
>=20
>   11a4c: f9400248      ldr     x8, [x18]
>   11a50: f942c508      ldr     x8, [x8, #1416]
>   11a54: d503201f      nop
>   11a58: 1008e3c9      adr     x9, #72824
>   11a5c: f9401508      ldr     x8, [x8, #40]
>   11a60: f8296900      str     x0, [x8, x9]
>   11a64: a9437bfd      ldp     x29, x30, [sp, #48]
>=20
> It is the "str x0, [x8, x9]" that vm_fault's for
> the optimized code.
>=20
> So:
>=20
>   11a4c: d0000089      adrp    x9, 0x23000
>   11a58: f943d929      ldr     x9, [x9, #1968]
>=20
> was optimized via replacement by:
>=20
>   11a58: 1008e3c9      adr     x9, #72824
>=20
> I.e., the optimization is based on the offset from
> the instruction being fixed in order to produce the
> value in x9, even if the instruction is relocated.
>=20
> This resulted in the specific x9 value shown in
> the x8/x9 pair:
>=20
> x8:         ea325df8
> x9: ffff0001eec946d0
>=20
> which total's to the fault address (value
> in far):
>=20
> far: ffff0002d8fba4c8
>=20
>=20
Is this the same as bug 264094?

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D264094






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2E9684B7-9359-4A3D-A0C2-C1D2B221F2C4>