Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 24 Jun 2023 14:27:08 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit
Message-ID:  <71F8A824-24DE-46F0-B41C-94AACD4E1D6A@yahoo.com>
In-Reply-To: <2CACE963-7846-475D-B139-D11B551E4A3F@yahoo.com>
References:  <3FD359F8-CFCC-400F-B6DE-B635B747DE7F@yahoo.com> <FAF014A1-88B5-4CAE-8A5C-2C2065528003@yahoo.com> <8E9937A8-1563-49C2-A1B1-150864C09AA0@yahoo.com> <2CACE963-7846-475D-B139-D11B551E4A3F@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 24, 2023, at 13:48, Mark Millard <marklmi@yahoo.com> wrote:

> On Jun 24, 2023, at 12:16, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> On Jun 24, 2023, at 10:49, Mark Millard <marklmi@yahoo.com> wrote:
>>=20
>>> On Jun 24, 2023, at 10:00, Mark Millard <marklmi@yahoo.com> wrote:
>>>=20
>>>> The running system build is a non-debug build (but
>>>> with symbols not stripped).
>>>>=20
>>>> The HoneyComb's console log shows:
>>>>=20
>>>> . . .
>>>> GEOM_STRIPE: Device stripe.IMfBZr destroyed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D =
5
>>>> GEOM_NOP: Device md0.nop removed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>>>> GEOM_NOP: Device md0.nop removed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> GEOM_NOP: Device md0.nop removed.
>>>> Fatal data abort:
>>>> x0: ffffa02506e64400
>>>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>>>> x2:               4b
>>>> x3: a343932b0b22fb30
>>>> x4:                0
>>>> x5:  3310b0d062d0e1d
>>>> x6: 1d0e2d060d0b3103
>>>> x7:                0
>>>> x8:         ea325df8
>>>> x9: ffff0001eec946d0 ($d.6 + 0)
>>>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8)
>>>> x11:                0
>>>> x12:                0
>>>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0)
>>>> x14:                0
>>>> x15: ffffa02506e64405
>>>> x16: ffff0001eec94860 (_DYNAMIC + 160)
>>>> x17: ffff00000063a450 (ifc_attach_cloner + 0)
>>>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178)
>>>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0)
>>>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18)
>>>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x23: ffffa0000042e500
>>>> x24: ffffa0000042e500
>>>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0)
>>>> x26: ffffa0203cdef780
>>>> x27: ffff0001eec94698 =
(__set_sysinit_set_sym_if_epairmodule_sys_init + 0)
>>>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8)
>>>> sp: ffff0001eb290400
>>>> lr: ffff0001eec82a4c ($x.1 + 3c)
>>>> elr: ffff0001eec82a60 ($x.1 + 50)
>>>> spsr:         60000045
>>>> far: ffff0002d8fba4c8
>>>> esr:         96000046
>>>> panic: vm_fault failed: ffff0001eec82a60 error 1
>>>> cpuid =3D 14
>>>> time =3D 1687625470
>>>> KDB: stack backtrace:
>>>> db_trace_self() at db_trace_self
>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>>> vpanic() at vpanic+0x13c
>>>> panic() at panic+0x44
>>>> data_abort() at data_abort+0x2fc
>>>> handle_el1h_sync() at handle_el1h_sync+0x14
>>>> --- exception, esr 0x96000046
>>>> $x.1() at $x.1+0x50
>>>> vnet_register_sysinit() at vnet_register_sysinit+0x114
>>>> linker_load_module() at linker_load_module+0xae4
>>>> kern_kldload() at kern_kldload+0xfc
>>>> sys_kldload() at sys_kldload+0x60
>>>> do_el0_sync() at do_el0_sync+0x608
>>>> handle_el0_sync() at handle_el0_sync+0x44
>>>> --- exception, esr 0x56000000
>>>> KDB: enter: panic
>>>> [ thread pid 70419 tid 101003 ]
>>>> Stopped at      kdb_enter+0x44: str     xzr, [x19, #3200]
>>>> db>=20
>>>>=20
>>>> I'll see if a re-run is repeatable.
>>>>=20
>>>=20
>>> It repeats:
>>>=20
>>> GEOM_STRIPE: Device stripe/stripe.VkbPk1 deactivated.
>>> GEOM_STRIPE: Disk md1 removed from stripe.VkbPk1.
>>> GEOM_STRIPE: Disk md0 removed from stripe.VkbPk1.
>>> GEOM_STRIPE: Device stripe.VkbPk1 destroyed.
>>> GEOM_NOP: Device md0.nop created.
>>> g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D =
5
>>> GEOM_NOP: Device md0.nop removed.
>>> GEOM_NOP: Device md0.nop created.
>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>>> g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D =
5
>>> GEOM_NOP: Device md0.nop removed.
>>> GEOM_NOP: Device md0.nop created.
>>> GEOM_NOP: Device md0.nop removed.
>>> Fatal data abort:
>>> x0: ffffa0003b1a9500
>>> x1: ffff00021b530260
>>> x2:               4b
>>> x3: a343932b0b22fb30
>>> x4:                0
>>> x5:  3310b0d062d0e1d
>>> x6: 1d0e2d060d0b3103
>>> x7:                0
>>> x8:         ea325df8
>>> x9: ffff00021d6946d0 ($d.6 + 0)
>>> x10: ffff00021b530260
>>> x11:                0
>>> x12:                0
>>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0)
>>> x14:                0
>>> x15: ffffa0003b1a9505
>>> x16: ffff00021d694860 (_DYNAMIC + 160)
>>> x17: ffff00000063a450 (ifc_attach_cloner + 0)
>>> x18: ffff00021a6ea400
>>> x19: ffff00021d694600 (vnet_epair_init_vnet_init + 0)
>>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18)
>>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>> x23: ffffa00000431500
>>> x24: ffffa00000431500
>>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0)
>>> x26: ffffa02e1ab6d180
>>> x27: ffff00021d694698 (__set_sysinit_set_sym_if_epairmodule_sys_init =
+ 0)
>>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>> x29: ffff00021a6ea430
>>> sp: ffff00021a6ea400
>>> lr: ffff00021d682a4c ($x.1 + 3c)
>>> elr: ffff00021d682a60 ($x.1 + 50)
>>> spsr:         60000045
>>> far: ffff0003079ba4c8
>>> esr:         96000046
>>> panic: vm_fault failed: ffff00021d682a60 error 1
>>> cpuid =3D 1
>>> time =3D 1687628622
>>> KDB: stack backtrace:
>>> db_trace_self() at db_trace_self
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>> vpanic() at vpanic+0x13c
>>> panic() at panic+0x44
>>> data_abort() at data_abort+0x2fc
>>> handle_el1h_sync() at handle_el1h_sync+0x14
>>> --- exception, esr 0x96000046
>>> $x.1() at $x.1+0x50
>>> vnet_register_sysinit() at vnet_register_sysinit+0x114
>>> linker_load_module() at linker_load_module+0xae4
>>> kern_kldload() at kern_kldload+0xfc
>>> sys_kldload() at sys_kldload+0x60
>>> do_el0_sync() at do_el0_sync+0x608
>>> handle_el0_sync() at handle_el0_sync+0x44
>>> --- exception, esr 0x56000000
>>> KDB: enter: panic
>>> [ thread pid 36377 tid 100985 ]
>>> Stopped at      kdb_enter+0x44: str     xzr, [x19, #3200]
>>> db>=20
>>>=20
>>>=20
>>> For reference, the output of the run in the ssh
>>> session ends with:
>>>=20
>>> . . .
>>> sys/kqueue/libkqueue/kqueue_test:main  ->  passed  [48.258s]
>>> sys/mac/bsdextended/ugidfw_test:main  ->  skipped: mac_bsdextended =
not loaded  [0.006s]
>>> sys/mac/portacl/nobody_test:main  ->  skipped: MAC_PORTACL is =
unavailable.  [0.010s]
>>> sys/mac/portacl/root_test:main  ->  skipped: MAC_PORTACL is =
unavailable.  [0.010s]
>>> sys/mqueue/mqueue_test:mqtest1  ->  passed  [0.025s]
>>> sys/mqueue/mqueue_test:mqtest2  ->  passed  [0.025s]
>>> sys/mqueue/mqueue_test:mqtest5  ->  passed  [0.025s]
>>> sys/net/if_ovpn/if_ovpn_c:tcp  ->  skipped: if_ovpn not loaded  =
[0.006s]
>>> sys/netinet/arp:arp_add_success  -> =20
>>>=20
>>> That should give some extra information about the context
>>> of failure.
>>=20
>> So I installed, booted, and tried my debug build. It failed
>> the same way in the same place, with no extra console
>> reporting for the crash by the debug code: no assertion
>> failures or WITNESS reports or the like first.
>=20
> I tried doing just:
>=20
> # kyua test -k /usr/tests/Kyuafile sys/netinet/arp
>=20
> and it crashed the same way at the same place. The prior
> kyua activity in other tests does not need to be involved
> to get the crash.
>=20
> For now I've touched the /usr/tests/sys/netinet/Kyuafile
> to comment out the arp test line. There does not seem to
> be a supported way to indicate to skip just one or a few
> specific tests. So I'm touching a do-not-touch file
> instead.
>=20
> We will see how far it gets when skipping sys/netinet/arp .


Same sort of crash:

sys/netinet/fibs:fibs_ifroutes1_success  -> =20

got:

GEOM_NOP: Device md0.nop created.
g_vfs_done():md0.nop[READ(offset=3D5885952, length=3D8192)]error =3D 5
GEOM_NOP: Device md0.nop removed.
GEOM_NOP: Device md0.nop created.
g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D 5
g_vfs_done():md0.nop[READ(offset=3D5935104, length=3D4096)]error =3D 5
GEOM_NOP: Device md0.nop removed.
GEOM_NOP: Device md0.nop created.
GEOM_NOP: Device md0.nop removed.
Fatal data abort:
  x0: ffffa025d2340300
  x1: ffff0001daf54000
  x2: ffff00000096349b (cam_status_table + 1cf4b)
  x3:              113
  x4:                0
  x5: 3a1d1c0900301419
  x6: 19143000091c1d3a
  x7: 656874655f676e00
  x8:         3f8b6760
  x9: ffff0002224a5590 ($d.6 + 0)
 x10:                0
 x11: ffff000000f41d48 (w_locklistdata + 2ca00)
 x12:                3
 x13:                2
 x14:            10000
 x15:                1
 x16:            10000
 x17: ffff0000005ba750 (ifc_attach_cloner + 0)
 x18: ffff0001d8fd73e0
 x19: ffff0002224a54c0 (vnet_epair_init_vnet_init + 0)
 x20: ffffa000003ea600
 x21: ffffa000003ea600
 x22:         5e4a6f28
 x23: ffffa0001c510f00
 x24: ffff000000a08fba (notify.prefix + 18b88)
 x25: ffff000000c53908 (linker_lookup_set_desc + 0)
 x26: ffffa02df9378c00
 x27: ffff000000d02000 (sdta_vfs_vop_vop_spare1_entry0 + 18)
 x28: ffff0001d8fd7788
 x29: ffff0001d8fd7410
  sp: ffff0001d8fd73e0
  lr: ffff000222493b8c ($x.1 + 3c)
 elr: ffff000222493ba0 ($x.1 + 50)
spsr:         20000045
 far: ffff000261d5bcf0 (next_index + 3f8b5c10)
 esr:         96000046
panic: vm_fault failed: ffff000222493ba0 error 1
cpuid =3D 15
time =3D 1687640644
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x354
handle_el1h_sync() at handle_el1h_sync+0x14
--- exception, esr 0x96000046
$x.1() at $x.1+0x50
vnet_register_sysinit() at vnet_register_sysinit+0x114
linker_load_module() at linker_load_module+0xa9c
kern_kldload() at kern_kldload+0x18c
sys_kldload() at sys_kldload+0x60
do_el0_sync() at do_el0_sync+0x590
handle_el0_sync() at handle_el0_sync+0x44
--- exception, esr 0x56000000
KDB: enter: panic
[ thread pid 37224 tid 100647 ]
Stopped at      kdb_enter+0x44: str     xzr, [x19, #3328]
db>=20

Disabling fibs and trying gets the same type of crash for:

lpm:lpm_test1_success  -> =20

(I'll not quote the backtrace related text.)

Disabling lpm in turn allowed sys/netinet/ to finish
without a system crash.

So I now have arp, fibs, and lpm commented out in
/usr/tests/sys/netinet/Kyuafile :

-- atf_test_program{name=3D"arp", }
. . .
-- atf_test_program{name=3D"fibs", }
. . .
-- atf_test_program{name=3D"lpm", }

We will see how far after sys/netinet/ /usr/tests/Kyuafile
gets with that in place.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?71F8A824-24DE-46F0-B41C-94AACD4E1D6A>