Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 Jul 2024 09:10:07 +0900
From:      Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
To:        stable@freebsd.org
Subject:   Re: x11/nvidia-driver fails on 14-STABLE/amd64
Message-ID:  <20240704091007.5dc5f7a41bf12f8f764a896d@dec.sakura.ne.jp>
In-Reply-To: <20240703082414.572553dabee65d0f6dd129a1@dec.sakura.ne.jp>
References:  <2458ffc88ffac503076c06cccafa0dc0@chen.org.nz> <20240703082414.572553dabee65d0f6dd129a1@dec.sakura.ne.jp>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 3 Jul 2024 08:24:14 +0900
Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:

> On Tue, 02 Jul 2024 22:11:45 +0000
> jonc@chen.org.nz wrote:
> 
> > Hi,
> > 
> > I updated my STABLE-14/amd64 to 1a0314d6e30554fc2b07caa5121b00956f416cc4 (ctladm: Fix a race....), and it appears that the latest kernel update breaks x11/nvidia-driver. The system panics when X starts up. Just to be sure I have rebuild and resinstalled x11/nvidia-driver with the updated /usr/src present. /var/log/messages has the following errors:
> > 
> > Jul  3 09:50:29 stormbringer kernel: ACPI Warning: \_SB.PC00.PEG1.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-212)
> > Jul  3 09:50:29 stormbringer kernel: Firmware Error (ACPI): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20221020/dsfield-352)
> > Jul  3 09:50:29 stormbringer kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20221020/dswload2-639)
> > Jul  3 09:50:29 stormbringer kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20221020/psparse-689)
> > Jul  3 09:51:52 stormbringer syslogd: kernel boot file is /boot/kernel/kernel
> > Jul  3 09:51:52 stormbringer kernel: NVRM: GPU at PCI:0000:01:00: GPU-db6a2e9b-ba08-3668-c104-d55596af9efb
> > Jul  3 09:51:52 stormbringer kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
> > Jul  3 09:51:52 stormbringer kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
> > Jul  3 09:51:52 stormbringer kernel: 
> > Jul  3 09:51:52 stormbringer syslogd: last message repeated 1 times
> > Jul  3 09:51:52 stormbringer kernel: Fatal trap 12: page fault while in kernel mode
> > Jul  3 09:51:52 stormbringer kernel: cpuid = 14; apic id = 38
> > Jul  3 09:51:52 stormbringer kernel: fault virtual address  = 0x0
> > Jul  3 09:51:52 stormbringer kernel: fault code     = supervisor read data, page not present
> > Jul  3 09:51:52 stormbringer kernel: instruction pointer    = 0x20:0xffffffff85bae56c
> > Jul  3 09:51:52 stormbringer kernel: stack pointer          = 0x28:0xfffffe01a894e5e0
> > Jul  3 09:51:52 stormbringer kernel: frame pointer          = 0x28:0xfffffe01adc85ce0
> > Jul  3 09:51:52 stormbringer kernel: code segment       = base 0x0, limit 0xfffff, type 0x1b
> > Jul  3 09:51:52 stormbringer kernel:            = DPL 0, pres 1, long 1, def32 0, gran 1
> > Jul  3 09:51:52 stormbringer kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
> > Jul  3 09:51:52 stormbringer kernel: current process        = 1954 (Xorg)
> > Jul  3 09:51:52 stormbringer kernel: rdi: fffffe01a951f000 rsi: fffffe01ae26f000 rdx: 0000000000000001
> > Jul  3 09:51:52 stormbringer kernel: rcx: 0000000000000000  r8: 00000000000000c0  r9: fffffe01adc858f0
> > Jul  3 09:51:52 stormbringer kernel: rax: 0000000000000000 rbx: fffffe01ae26f000 rbp: fffffe01adc85ce0
> > Jul  3 09:51:52 stormbringer kernel: r10: 000000005237a738 r11: 0000000066847626 r12: 0000000000000000
> > Jul  3 09:51:52 stormbringer kernel: r13: fffffe01a951f000 r14: 0000000000000001 r15: fffffe01ade09008
> > Jul  3 09:51:52 stormbringer kernel: trap number        = 12
> > Jul  3 09:51:52 stormbringer kernel: panic: page fault
> > Jul  3 09:51:52 stormbringer kernel: cpuid = 14
> > Jul  3 09:51:52 stormbringer kernel: time = 1719957030
> > Jul  3 09:51:52 stormbringer kernel: KDB: stack backtrace:
> > Jul  3 09:51:52 stormbringer kernel: #0 0xffffffff80b8002d at kdb_backtrace+0x5d
> > Jul  3 09:51:52 stormbringer kernel: #1 0xffffffff80b32c51 at vpanic+0x131
> > Jul  3 09:51:52 stormbringer kernel: #2 0xffffffff80b32b13 at panic+0x43
> > Jul  3 09:51:52 stormbringer kernel: #3 0xffffffff8100194b at trap_fatal+0x40b
> > Jul  3 09:51:52 stormbringer kernel: #4 0xffffffff81001996 at trap_pfault+0x46
> > Jul  3 09:51:52 stormbringer kernel: #5 0xffffffff80fd8458 at calltrap+0x8
> > 
> > When I reverted to my previous kernel, X started up without any issues.
> > 
> > Cheers
> > --
> > Jonathan Chen <jonc@chen.org.nz>
> 
> Did you tried rebuilding x11/nvidia-driver from ports AFTER INSTALLING
> NEW KERNEL AND WORLD?
> 
> If yes, any of commits AFTER commit
> 620a6a54bb7bb6e1c5607092b6ec49e353e0925f [1] should broke something.
> (As I'm on the commit and x11/nvidia-driver 555.58 (overrided
> DISTVERSION and setting NO_CHECKSUM=YES on build) to try this new
> feature branch of driver) isworking fine.
> This case, if your old build is older than this and if you want fix for
> FreeBSD-SA-24:04.openssh, the above-mentioned commit is worth trying.
> 
> Additional note:
> If you are using graphics/nvidia-drm-[515|61]-kmod port, 
> you need to apply the patch attached at Bug 279539 [2] to build.
> 
> And if you want to test 555 series of nvidia-drm*-kmod driver, you need
> to apply the diff at Differential revision D45400 of Phablicator [3],
> too.
> 
> [1]
> https://cgit.freebsd.org/src/commit/?h=stable/14&id=620a6a54bb7bb6e1c5607092b6ec49e353e0925f
> 
> [2] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279539
> 
> [3] https://reviews.freebsd.org/D45400
> 
> -- 
> Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>

Updated stable/14 to commit 342053a66c161c12f6887efac913c80040959ae8,
which is the next commit to the reported one.

X starts as usual.
So any of commits between
 620a6a54bb7bb6e1c5607092b6ec49e353e0925f
and
 342053a66c161c12f6887efac913c80040959ae8
doesn't seem to matter, at least for 555.58 of new feature branch
x11/nvidia-driver.

And note that I'm running on nvidia discrete GPU, disabling Intel iGPU
in my CPU via firmware configuration.

And the same version of the driver is working on main branch of base at
commit 59c21ed6e811c753f7806766ba45a5bfa71ae2ed.

As main branch is just a test bed environment, it's not yet updated to
the commit fixing openssh. My daily driver is stable/14.

BTW, how are you start X? And on which commit was your working kernel
built from?

If you auto start X on boot via something like xdm, it could mask the
cause of the panic, at least one reason.
If you are loading nvidia[-modeset].ko via /boot/loader.conf, never
attempt to do so. Remove or comment out the line from /boot/loader.conf
and add the module on kld_list variable in /etc/rc.conf[.local].

If you start X from command line with startx and the cause was as above,
you should see the panic on loading the module.

If your previous working kernel was old enough to be called "giant
step", i.e. from older stable branch like stable/12, things could be
much more complexed.

-- 
Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20240704091007.5dc5f7a41bf12f8f764a896d>