Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 24 Jun 2023 17:25:35 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        John F Carr <jfc@mit.edu>
Cc:        Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit
Message-ID:  <4A380699-7C9E-4E2E-8DCD-F9ECC2112667@yahoo.com>
In-Reply-To: <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu>
References:  <3FD359F8-CFCC-400F-B6DE-B635B747DE7F.ref@yahoo.com> <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <CB3569D4-8FEE-4DD3-83CE-885789E79E18@mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote:

>=20
>> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote:
>>=20
>> The running system build is a non-debug build (but
>> with symbols not stripped).
>>=20
>> The HoneyComb's console log shows:
>>=20
>> . . .
>> GEOM_STRIPE: Device stripe.IMfBZr destroyed.
>> GEOM_NOP: Device md0.nop created.
>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D =
5
>> GEOM_NOP: Device md0.nop removed.
>> GEOM_NOP: Device md0.nop created.
>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>> GEOM_NOP: Device md0.nop removed.
>> GEOM_NOP: Device md0.nop created.
>> GEOM_NOP: Device md0.nop removed.
>> Fatal data abort:
>> x0: ffffa02506e64400
>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>> x2:               4b
>> x3: a343932b0b22fb30
>> x4:                0
>> x5:  3310b0d062d0e1d
>> x6: 1d0e2d060d0b3103
>> x7:                0
>> x8:         ea325df8
>> x9: ffff0001eec946d0 ($d.6 + 0)
>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>> x11:                0
>> x12:                0
>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0)
>> x14:                0
>> x15: ffffa02506e64405
>> x16: ffff0001eec94860 (_DYNAMIC + 160)
>> x17: ffff00000063a450 (ifc_attach_cloner + 0)
>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178)
>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0)
>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18)
>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>> x23: ffffa0000042e500
>> x24: ffffa0000042e500
>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0)
>> x26: ffffa0203cdef780
>> x27: ffff0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init =
+ 0)
>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8)
>> sp: ffff0001eb290400
>> lr: ffff0001eec82a4c ($x.1 + 3c)
>> elr: ffff0001eec82a60 ($x.1 + 50)
>> spsr:         60000045
>> far: ffff0002d8fba4c8
>> esr:         96000046
>> panic: vm_fault failed: ffff0001eec82a60 error 1
>> cpuid =3D 14
>> time =3D 1687625470
>> KDB: stack backtrace:
>> db_trace_self() at db_trace_self
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>> vpanic() at vpanic+0x13c
>> panic() at panic+0x44
>> data_abort() at data_abort+0x2fc
>> handle_el1h_sync() at handle_el1h_sync+0x14
>> --- exception, esr 0x96000046
>> $x.1() at $x.1+0x50
>> vnet_register_sysinit() at vnet_register_sysinit+0x114
>> linker_load_module() at linker_load_module+0xae4
>> kern_kldload() at kern_kldload+0xfc
>> sys_kldload() at sys_kldload+0x60
>> do_el0_sync() at do_el0_sync+0x608
>> handle_el0_sync() at handle_el0_sync+0x44
>> --- exception, esr 0x56000000
>> KDB: enter: panic
>> [ thread pid 70419 tid 101003 ]
>> Stopped at      kdb_enter+0x44: str     xzr, [x19, #3200]
>> db>=20
>=20
> The failure appears to be initializing module if_epair.

Yep: trying:

# kldload if_epair.ko

was enough to cause the crash. (Just a HoneyComb context at
that point.)

I tried media dd'd from the recent main snapshot, booting the
same system. No crash. I moved my build boot media to some
other systems and tested them: crashes. I tried my boot media
built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C
instead of Cortex-A72: no crashes. (But only one system can
use the X1C/A78C code in that build.)

So variation testing only gets the crashes for my builds
that are code-optimized for Cortex-A72's. The same source
tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C
optimization does not get the crashes. But I also
demonstrated an optmized for Cortex-A72 build from 2023-Mar
that gets the crash.

The last time I ran into one of these "crashes tied to
cortex-a72 code optimization" examples it turned out to be
some missing memory-model management code in FreeBSD's USB
code. But being lucky enough to help identify a FreeBSD
source code problem again seems not that likely. It could
easily be a code generation error by clang for all I know.

So, unless at some point I produce fairly solid evidence
that the code actually running is messed up by FreeBSD
source code, this should likely be treated as "blame the
operator" and should likely be largely ignored as things
are. (Just My Problem, as I want the Cortex-A72 optimized
builds.)

Sorry for the noise.

> I see no recent changes in that module that would be likely to break =
initialization.
>=20
> a9bfd080d09a if_epair: do not transmit packets that exceed the =
interface MTU
> 4d846d260e2b spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, =
drop -FreeBSD
> a6b55ee6be15 net: replace IFF_KNOWSEPOCH with IFF_NEEDSEPOCH
> c69ae8419734 if_epair: also remove vlan metadata from mbufs
> 29c9b1673305 epair: Remove unneeded includes and sort some of the rest

My kyua run examples included a Cortex-A72 optimized system build
from last 2023-Mar. It also crashes. It looks like my last kyua
runs were back in 2022-Jan or so, associated with some ASAN and
UBSAN experiments --and so would have been on amd64, not aarch64.
Otherwise any aarch64 ones would be even older. I've no useful
narrowing of the potential time frame for the problem starting.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4A380699-7C9E-4E2E-8DCD-F9ECC2112667>