Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 04 Jan 2022 22:52:02 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 253461] [AMD/ATI] RV730 PRO [Radeon HD 4650] panic kernel
Message-ID:  <bug-253461-227-ti6GyUovpe@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-253461-227@https.bugs.freebsd.org/bugzilla/>
References:  <bug-253461-227@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D253461

Bill Paul <noisetube@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |noisetube@gmail.com

--- Comment #3 from Bill Paul <noisetube@gmail.com> ---
I believe I have a fix for this bug. It is a problem with the linuxkpi code=
 in
the FreeBSDDesktop-kms-drm-4.16.g20201016-8843e1fc5_GH0.tar.gz distribution.

Notes:

- This problem has been there for some time. I've had it happen in FreeBSD
12.2-RELEASE and FreeBSD 12.3-RELEASE.

- It's not confined to a single Radeon card. I've observed the problem with=
 the
following hardware on different machines:

vgapci0@pci0:1:0:0:     class=3D0x030000 card=3D0x21261028 chip=3D0x68f9100=
2 rev=3D0x00
hdr=3D0x00
    vendor     =3D 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     =3D 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
    class      =3D display
    subclass   =3D VGA

vgapci0@pci0:0:1:0: class=3D0x030000 card=3D0x168b103c chip=3D0x96481002 re=
v=3D0x00
hdr=3D0x00
    vendor     =3D 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     =3D 'Sumo [Radeon HD 6480G]'
    class      =3D display
    subclass   =3D VGA

vgapci1@pci0:131:0:0:   class=3D0x030000 card=3D0x90b8103c chip=3D0x6771100=
2 rev=3D0x00
hdr=3D0x00
    vendor     =3D 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     =3D 'Caicos XTX [Radeon HD 8490 / R5 235X OEM]'
    class      =3D display
    subclass   =3D VGA

(Note that the Sumo device is built into a laptop, an HP ProBook 4535S.)

- This problem has been reported by others. PR 237544 is a duplicate. The
panics I experienced had the same stack traces as shown in both PRs.

- PR 237544 provides an important hint that this crash did _not_ happen with
the drm-fbsd11.2-kmod port/package. Although it has been deprecated, I was =
able
to build and install the drm-fbsd11.2-kmod code on my FreeBSD 12.3-RELEASE
system (the laptop) and the crashes went away.

- In my case, the panics were more likely to occur when the system was under
load. The laptop seemed to trigger it more frequently (which actually made =
it
easier to track it down).

I tried to track the problem down by comparing the the drm-fbsd11.2-kmod and
drm-fbsd12.0-kmod code and swapping bits of the 11.2 code into the 12.0 tre=
e to
see what effect that would have. Eventually I traced the problem to the
linuxkpi code, and then to the dma-fence code, and then finally, to this
function in linuxkpi/gplv2/include/linux/dma-fence.h:

static inline void
dma_fence_signal_locked_sub(struct dma_fence *fence)
{
        struct dma_fence_cb *cur;

        while ((cur =3D list_first_entry_or_null(&fence->cb_list,
                    struct dma_fence_cb, node)) !=3D NULL) {
                list_del_init(&cur->node);
                spin_unlock(fence->lock);   /* <-- No! */
                cur->func(fence, cur);
                spin_lock(fence->lock);     /* <-- No! */
        }
}=20

Note the two lines highlited above.

The dma_fence_signal_locked_sub() routine is shared by both dma_fence_signa=
l()
and dma_fence_signal_locked(). The latter function is intended to be used w=
hen
the caller is already holding the fence spinlock. The former takes the spin=
lock
itself.

The problem is that the above code causes the spinlock to be dropped in the
case where dma_fence_signal() is called. This is not the same behavior as t=
he
older 11.2 code: in that case, the lock is held while the callouts are invo=
ked.
(I *think* this is also the case in the later code in FreeBSD 13 too.) I
believe that dropping the lock before calling the callouts opens a race
condition window and this is what leads to the crash. It's difficult to
ascertain that this is the what's happening from the crash stack traces, bu=
t in
my analysis I found that at least sometimes the problem was that something =
was
trying to dereference a NULL DMA fence pointer.

I patched my copy of the code to remove the spin_unlock() and spin_lock() c=
alls
shown above, and that seemed to fix the problem. The laptop has not crashed
since I did this. I also made the same change to the 12.2-RELEASE system wi=
th
the "Cedar" card and exercised it a bit, and that one seemed to run ok too.=
 I
have just patched the "Caicos" machine today and so far it's running stable=
 as
well (this is my work machine and this is my first day back at the office f=
or
the new year).

I created a version of the drm-fbsd12.0-kmod port with this change included=
 as
a patch, which can be downloaded from here:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

I will also attach the patch to this PR.

Can someone please test this to see if it fixes the problem for them too?

Note: I happen to have about 3 or 4 extra Radeon cards as spares (I rescued
these from the e-waste bin) and would be happen to send one to a developer =
if
that would help (assuming they have a machine with a slot that can accommod=
ate
it).

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-253461-227-ti6GyUovpe>