Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Dec 2024 22:01:19 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko
Message-ID:  <bug-267028-227-6araGZRz9I@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-267028-227@https.bugs.freebsd.org/bugzilla/>
References:  <bug-267028-227@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D267028

--- Comment #235 from Mark Millard <marklmi26-fbsd@yahoo.com> ---
For the 3 node sequence (last partially-good and then
just-junk):

$208 =3D {link =3D {tqe_next =3D 0xfffff80004607a00, tqe_prev =3D 0xfffff80=
00465bc80},
container =3D 0xfffff80003868c00, name =3D 0xffffffff82e1e000
<xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw",=20
  version =3D 1}
$209 =3D {link =3D {tqe_next =3D 0xfffff80000000007, tqe_prev =3D 0xfffff80=
00465bbc0},
container =3D 0xfffff80004b29600, name =3D 0xffffffff82e62026 <se_mask+242>
"amdgpu_raven_mec2_bin_fw", version =3D 1}
$210 =3D {link =3D {tqe_next =3D 0xeef3f000e2c3f0, tqe_prev =3D 0xff54f000e=
ef3f0},
container =3D 0x322ff0003287f0, name =3D 0xe987f000fea5f0 <error: Cannot ac=
cess
memory at address 0xe987f000fea5f0>,=20
  version =3D 15660016}

it looks like the:

$209 =3D {link =3D {tqe_next =3D 0xfffff80000000007,

is the earliest example of (evidence of) corruption. The
address is outside of (smaller address than) the kernel
start:

Local exec file:
        `/usr/home/root/failing-kernel-files/boot/kernel/kernel', file type
elf64-x86-64-freebsd.
        Entry point: 0xffffffff8038e000
        0xffffffff802002a8 - 0xffffffff802002b5 is .interp

Having 0000000007 also looks odd.

However, the rest of that node:

tqe_prev =3D 0xfffff8000465bbc0}, container =3D 0xfffff80004b29600, name =3D
0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version =3D 1}

does not look to have any obvious problems with its content. The
contents of the container are shown as:

$214 =3D {ops =3D 0xfffff80003164000, refs =3D 1, userrefs =3D 0, flags =3D=
 1, link =3D
{tqe_next =3D 0xfffff8000469ed80, tqe_prev =3D 0xfffff80003868c18}, filenam=
e =3D
0xfffff80004b22120 "amdgpu_raven_mec2_bin.ko",=20
  pathname =3D 0xfffff80004607a40 "/boot/modules/amdgpu_raven_mec2_bin.ko",=
 id =3D
20, address =3D 0xffffffff82e61000 <link_enc_regs+1520> "\203\376\001tL\270=
\026",
size =3D 276456, ctors_addr =3D 0x0,=20
  ctors_size =3D 0, dtors_addr =3D 0x0, dtors_size =3D 0, ndeps =3D 3, deps=
 =3D
0xfffff80004b220e0, common =3D {stqh_first =3D 0x0, stqh_last =3D
0xfffff80004b29680}, modules =3D {tqh_first =3D 0xfffff80004b1ff00,=20
    tqh_last =3D 0xfffff80004b1ff10}, loaded =3D {tqe_next =3D 0x0, tqe_pre=
v =3D 0x0},
loadcnt =3D 20, nenabled =3D 0, fbt_nentries =3D 0}

which also seems to not have obvious problems.

The type of vmcore.* does not provide threads, stack content, or
backtrace information. Nor is there any indication of any detailed
point for when the tqe_next =3D 0xfffff80000000007 became the case.

It is not necessarily obvious if the list was longer before the
0xfffff80000000007 became the case.

There does not seem to be a way to tell if the corrupted value might be
becuase of "raven" specific code vs. more general code. It would be
interesting to know if an alternate card type has the problem vs. not.

As for the raven context, getting vmcore.* captures that fail at a
different stage, such as the failure that mentioned acpi_wmi but did
not get a vmcore.* , would help indicate if where the corruption
happens in the list moves around (relative to other content).

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-267028-227-6araGZRz9I>