Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 26 Jun 2023 01:32:03 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        John F Carr <jfc@mit.edu>
Cc:        Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit
Message-ID:  <64F18C76-BD2A-4608-A8CC-38AC2820FC12@yahoo.com>
In-Reply-To: <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com>
References:  <3FD359F8-CFCC-400F-B6DE-B635B747DE7F.ref@yahoo.com> <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu> <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 24, 2023, at 17:25, Mark Millard <marklmi@yahoo.com> wrote:

> On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote:
>=20
>>=20
>>> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote:
>>>=20
>>> The running system build is a non-debug build (but
>>> with symbols not stripped).
>>>=20
>>> The HoneyComb's console log shows:
>>>=20
>>> . . .
>>> GEOM_STRIPE: Device stripe.IMfBZr destroyed.
>>> GEOM_NOP: Device md0.nop created.
>>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D =
5
>>> GEOM_NOP: Device md0.nop removed.
>>> GEOM_NOP: Device md0.nop created.
>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>>> GEOM_NOP: Device md0.nop removed.
>>> GEOM_NOP: Device md0.nop created.
>>> GEOM_NOP: Device md0.nop removed.
>>> Fatal data abort:
>>> x0: ffffa02506e64400
>>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>>> x2:               4b
>>> x3: a343932b0b22fb30
>>> x4:                0
>>> x5:  3310b0d062d0e1d
>>> x6: 1d0e2d060d0b3103
>>> x7:                0
>>> x8:         ea325df8
>>> x9: ffff0001eec946d0 ($d.6 + 0)
>>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>>> x11:                0
>>> x12:                0
>>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0)
>>> x14:                0
>>> x15: ffffa02506e64405
>>> x16: ffff0001eec94860 (_DYNAMIC + 160)
>>> x17: ffff00000063a450 (ifc_attach_cloner + 0)
>>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178)
>>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0)
>>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18)
>>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>> x23: ffffa0000042e500
>>> x24: ffffa0000042e500
>>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0)
>>> x26: ffffa0203cdef780
>>> x27: ffff0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init =
+ 0)
>>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8)
>>> sp: ffff0001eb290400
>>> lr: ffff0001eec82a4c ($x.1 + 3c)
>>> elr: ffff0001eec82a60 ($x.1 + 50)
>>> spsr:         60000045
>>> far: ffff0002d8fba4c8
>>> esr:         96000046
>>> panic: vm_fault failed: ffff0001eec82a60 error 1
>>> cpuid =3D 14
>>> time =3D 1687625470
>>> KDB: stack backtrace:
>>> db_trace_self() at db_trace_self
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>> vpanic() at vpanic+0x13c
>>> panic() at panic+0x44
>>> data_abort() at data_abort+0x2fc
>>> handle_el1h_sync() at handle_el1h_sync+0x14
>>> --- exception, esr 0x96000046
>>> $x.1() at $x.1+0x50
>>> vnet_register_sysinit() at vnet_register_sysinit+0x114
>>> linker_load_module() at linker_load_module+0xae4
>>> kern_kldload() at kern_kldload+0xfc
>>> sys_kldload() at sys_kldload+0x60
>>> do_el0_sync() at do_el0_sync+0x608
>>> handle_el0_sync() at handle_el0_sync+0x44
>>> --- exception, esr 0x56000000
>>> KDB: enter: panic
>>> [ thread pid 70419 tid 101003 ]
>>> Stopped at      kdb_enter+0x44: str     xzr, [x19, #3200]
>>> db>=20
>>=20
>> The failure appears to be initializing module if_epair.
>=20
> Yep: trying:
>=20
> # kldload if_epair.ko
>=20
> was enough to cause the crash. (Just a HoneyComb context at
> that point.)
>=20
> I tried media dd'd from the recent main snapshot, booting the
> same system. No crash. I moved my build boot media to some
> other systems and tested them: crashes. I tried my boot media
> built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C
> instead of Cortex-A72: no crashes. (But only one system can
> use the X1C/A78C code in that build.)
>=20
> So variation testing only gets the crashes for my builds
> that are code-optimized for Cortex-A72's. The same source
> tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C
> optimization does not get the crashes. But I also
> demonstrated an optmized for Cortex-A72 build from 2023-Mar
> that gets the crash.
>=20
> The last time I ran into one of these "crashes tied to
> cortex-a72 code optimization" examples it turned out to be
> some missing memory-model management code in FreeBSD's USB
> code. But being lucky enough to help identify a FreeBSD
> source code problem again seems not that likely. It could
> easily be a code generation error by clang for all I know.
>=20
> So, unless at some point I produce fairly solid evidence
> that the code actually running is messed up by FreeBSD
> source code, this should likely be treated as "blame the
> operator" and should likely be largely ignored as things
> are. (Just My Problem, as I want the Cortex-A72 optimized
> builds.)

Turns out that the source code in question is the
assignment to V_epair_cloner below:

static void
vnet_epair_init(const void *unused __unused)
{
        struct if_clone_addreq req =3D {
                .match_f =3D epair_clone_match,
                .create_f =3D epair_clone_create,
                .destroy_f =3D epair_clone_destroy,
        };
        V_epair_cloner =3D ifc_attach_cloner(epairname, &req);
}
VNET_SYSINIT(vnet_epair_init, SI_SUB_PSEUDO, SI_ORDER_ANY,
    vnet_epair_init, NULL);

Example code when not optimizing for the Cortex-A72:

   11a4c: d0000089      adrp    x9, 0x23000
   11a50: f9400248      ldr     x8, [x18]
   11a54: f942c508      ldr     x8, [x8, #1416]
   11a58: f943d929      ldr     x9, [x9, #1968]
   11a5c: a9437bfd      ldp     x29, x30, [sp, #48]
   11a60: f9401508      ldr     x8, [x8, #40]
   11a64: f8296900      str     x0, [x8, x9]

The code when optmizing for the Cortex-A72:

   11a4c: f9400248      ldr     x8, [x18]
   11a50: f942c508      ldr     x8, [x8, #1416]
   11a54: d503201f      nop
   11a58: 1008e3c9      adr     x9, #72824
   11a5c: f9401508      ldr     x8, [x8, #40]
   11a60: f8296900      str     x0, [x8, x9]
   11a64: a9437bfd      ldp     x29, x30, [sp, #48]

It is the "str x0, [x8, x9]" that vm_fault's for
the optimized code.

So:

   11a4c: d0000089      adrp    x9, 0x23000
   11a58: f943d929      ldr     x9, [x9, #1968]

was optimized via replacement by:

   11a58: 1008e3c9      adr     x9, #72824

I.e., the optimization is based on the offset from
the instruction being fixed in order to produce the
value in x9, even if the instruction is relocated.

This resulted in the specific x9 value shown in
the x8/x9 pair:

 x8:         ea325df8
 x9: ffff0001eec946d0

which total's to the fault address (value
in far):

far: ffff0002d8fba4c8


> Sorry for the noise.
>=20
>> I see no recent changes in that module that would be likely to break =
initialization.
>>=20
>> a9bfd080d09a if_epair: do not transmit packets that exceed the =
interface MTU
>> 4d846d260e2b spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, =
drop -FreeBSD
>> a6b55ee6be15 net: replace IFF_KNOWSEPOCH with IFF_NEEDSEPOCH
>> c69ae8419734 if_epair: also remove vlan metadata from mbufs
>> 29c9b1673305 epair: Remove unneeded includes and sort some of the =
rest
>=20
> My kyua run examples included a Cortex-A72 optimized system build
> from last 2023-Mar. It also crashes. It looks like my last kyua
> runs were back in 2022-Jan or so, associated with some ASAN and
> UBSAN experiments --and so would have been on amd64, not aarch64.
> Otherwise any aarch64 ones would be even older. I've no useful
> narrowing of the potential time frame for the problem starting.



=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?64F18C76-BD2A-4608-A8CC-38AC2820FC12>