Date: Thu, 24 Jun 2021 19:30:52 -0700 From: Craig Leres <leres@freebsd.org> To: freebsd-hackers@freebsd.org Subject: nvidia_drv.so/Xorg crashes Message-ID: <ee9b36a1-1e50-7190-8be6-7cc1f13cec42@freebsd.org>
next in thread | raw e-mail | index | archive | help
I have four (12.2-RELEASE) systems between the office at home that are full or part time FreeBSD desktops. All have pny nvidia quadro 410's. These have been mostly working well for about 6 years. For months I've started seeing screen corruption when using chrome or kicad; firefox and thunderbird are always ok. But just starting eeschema always damages the root window a little. And it's common when running chrome/kicad to see lines in the console xterm window jump up and down two lines. But for the last week or two Xorg has been crashing: [ 74574.029] (EE) Backtrace: [ 74574.032] (EE) 0: /usr/local/bin/Xorg (?+0x0) [0x41c98a] [ 74574.033] (EE) unw_get_proc_name failed: no unwind info found [-10] [ 74574.033] (EE) 1: /lib/libthr.so.3 (?+0x0) [0x800929b7e] [ 74574.035] (EE) unw_get_proc_name failed: no unwind info found [-10] [ 74574.035] (EE) 2: /lib/libthr.so.3 (?+0x0) [0x80092913f] [ 74574.037] (EE) 3: ? (?+0x0) [0x7ffffffff003] [ 74574.038] (EE) 4: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x801cc8c20] [ 74574.038] (EE) [ 74574.038] (EE) Segmentation fault at address 0x50 [ 74574.038] (EE) Fatal server error: [ 74574.038] (EE) Caught signal 11 (Segmentation fault). Server aborting The crashes are always preceded by at least one nvidia "Xid" kernel message: Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb, ErrorCode 00000004 Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb, ErrorCode 00000004 Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data ffffffb9, ErrorCode 00000004 Jun 23 ... kernel: : pid 6327 (Xorg), jid 0, uid 0: exited on signal 6 Worth noting is that it was not unusual to see many Xid ErrorCode 4 kernel messages without crashes. (And it's the only ErrorCode I've ever seen.) My first thought was bad nvidia-driver version. But after working my way, one by one, down to 460.39 (circa February 2021 -- months before the first crashes) I gave up on that theory. My next guess bad hardware but I swapped quadro's between two systems and the crashes persisted. Yesterday Xorg crashed often enough for me to zero on the trigger; it's the use of tvtwm's f.forcemove action (which is like f.move but allows moving a windows off the screen) if I move a window slightly off the bottom of the screen. Here's the .twmrc binding I use: Button2 = m s : window : f.forcemove The crash doesn't happen 100% of the time but it's pretty easy to trigger with half a dozen windows open. Just grab a window and randomly dip part of it past the bottom of the screen. So my new theory is a frame buffer operation in one of the libraries the path between Xorg and the nvidia driver has regressed and is asking the nvidia driver to do something that causes it to do something bad. I run a custom version of tvtwm but was able to easily crash Xorg using x11-wm/twm on a spare quadro 410 workstation; the key is f.forcemove. Does anybody know what this issue is? What are likely candidates of recently changed port libraries that I could try downgrading? Should I try opening a ticket with nvidia? Should I try even older 460.XX drivers? What else can I try? (Thanks for reading this far!) Craig
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ee9b36a1-1e50-7190-8be6-7cc1f13cec42>