Date: Tue, 04 Jan 2022 22:52:02 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 253461] [AMD/ATI] RV730 PRO [Radeon HD 4650] panic kernel Message-ID: <bug-253461-227-ti6GyUovpe@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-253461-227@https.bugs.freebsd.org/bugzilla/> References: <bug-253461-227@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D253461 Bill Paul <noisetube@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |noisetube@gmail.com --- Comment #3 from Bill Paul <noisetube@gmail.com> --- I believe I have a fix for this bug. It is a problem with the linuxkpi code= in the FreeBSDDesktop-kms-drm-4.16.g20201016-8843e1fc5_GH0.tar.gz distribution. Notes: - This problem has been there for some time. I've had it happen in FreeBSD 12.2-RELEASE and FreeBSD 12.3-RELEASE. - It's not confined to a single Radeon card. I've observed the problem with= the following hardware on different machines: vgapci0@pci0:1:0:0: class=3D0x030000 card=3D0x21261028 chip=3D0x68f9100= 2 rev=3D0x00 hdr=3D0x00 vendor =3D 'Advanced Micro Devices, Inc. [AMD/ATI]' device =3D 'Cedar [Radeon HD 5000/6000/7350/8350 Series]' class =3D display subclass =3D VGA vgapci0@pci0:0:1:0: class=3D0x030000 card=3D0x168b103c chip=3D0x96481002 re= v=3D0x00 hdr=3D0x00 vendor =3D 'Advanced Micro Devices, Inc. [AMD/ATI]' device =3D 'Sumo [Radeon HD 6480G]' class =3D display subclass =3D VGA vgapci1@pci0:131:0:0: class=3D0x030000 card=3D0x90b8103c chip=3D0x6771100= 2 rev=3D0x00 hdr=3D0x00 vendor =3D 'Advanced Micro Devices, Inc. [AMD/ATI]' device =3D 'Caicos XTX [Radeon HD 8490 / R5 235X OEM]' class =3D display subclass =3D VGA (Note that the Sumo device is built into a laptop, an HP ProBook 4535S.) - This problem has been reported by others. PR 237544 is a duplicate. The panics I experienced had the same stack traces as shown in both PRs. - PR 237544 provides an important hint that this crash did _not_ happen with the drm-fbsd11.2-kmod port/package. Although it has been deprecated, I was = able to build and install the drm-fbsd11.2-kmod code on my FreeBSD 12.3-RELEASE system (the laptop) and the crashes went away. - In my case, the panics were more likely to occur when the system was under load. The laptop seemed to trigger it more frequently (which actually made = it easier to track it down). I tried to track the problem down by comparing the the drm-fbsd11.2-kmod and drm-fbsd12.0-kmod code and swapping bits of the 11.2 code into the 12.0 tre= e to see what effect that would have. Eventually I traced the problem to the linuxkpi code, and then to the dma-fence code, and then finally, to this function in linuxkpi/gplv2/include/linux/dma-fence.h: static inline void dma_fence_signal_locked_sub(struct dma_fence *fence) { struct dma_fence_cb *cur; while ((cur =3D list_first_entry_or_null(&fence->cb_list, struct dma_fence_cb, node)) !=3D NULL) { list_del_init(&cur->node); spin_unlock(fence->lock); /* <-- No! */ cur->func(fence, cur); spin_lock(fence->lock); /* <-- No! */ } }=20 Note the two lines highlited above. The dma_fence_signal_locked_sub() routine is shared by both dma_fence_signa= l() and dma_fence_signal_locked(). The latter function is intended to be used w= hen the caller is already holding the fence spinlock. The former takes the spin= lock itself. The problem is that the above code causes the spinlock to be dropped in the case where dma_fence_signal() is called. This is not the same behavior as t= he older 11.2 code: in that case, the lock is held while the callouts are invo= ked. (I *think* this is also the case in the later code in FreeBSD 13 too.) I believe that dropping the lock before calling the callouts opens a race condition window and this is what leads to the crash. It's difficult to ascertain that this is the what's happening from the crash stack traces, bu= t in my analysis I found that at least sometimes the problem was that something = was trying to dereference a NULL DMA fence pointer. I patched my copy of the code to remove the spin_unlock() and spin_lock() c= alls shown above, and that seemed to fix the problem. The laptop has not crashed since I did this. I also made the same change to the 12.2-RELEASE system wi= th the "Cedar" card and exercised it a bit, and that one seemed to run ok too.= I have just patched the "Caicos" machine today and so far it's running stable= as well (this is my work machine and this is my first day back at the office f= or the new year). I created a version of the drm-fbsd12.0-kmod port with this change included= as a patch, which can be downloaded from here: http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz I will also attach the patch to this PR. Can someone please test this to see if it fixes the problem for them too? Note: I happen to have about 3 or 4 extra Radeon cards as spares (I rescued these from the e-waste bin) and would be happen to send one to a developer = if that would help (assuming they have a machine with a slot that can accommod= ate it). --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-253461-227-ti6GyUovpe>