Date: Tue, 08 Feb 2022 05:42:18 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 261690] NFSv4 mount on Linux client hangs during complex access patterns (gcc bootstrapping on client) Message-ID: <bug-261690-227-aaomNH3FqU@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-261690-227@https.bugs.freebsd.org/bugzilla/> References: <bug-261690-227@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D261690 --- Comment #2 from Joshua Kinard <kumba@gentoo.org> --- This is a curiously apt find. I've got two old MIPS systems, same machine types (SGI Octanes), running a Linux-5.4 LTS kernel w/ custom patches and a Gentoo userland, which mount the Gentoo "portage" tree (similar to Ports) a= nd several other common folders over NFS4.2. The NFS server is my NAS running FreeBSD 13.0-RELEASE-p7 and the exports are on a ZFS file system. The FreeBSD kernel on the NAS box is a custom configuration and has these patches from Phab applied on top: - https://reviews.freebsd.org/D18985 - https://reviews.freebsd.org/D29088 - https://reviews.freebsd.org/D29315 - https://reviews.freebsd.org/D29772 - https://reviews.freebsd.org/D29838 - https://reviews.freebsd.org/D30318 - https://reviews.freebsd.org/D32724 As well as these patches listed from Bugzilla PRs (some cherry-picked before their bugs were fixed and closed): - Bug #254560 - Bug #254590 - Bug #256280 - Bug #260293 - Bug #260375 - Bug #260884 When either of those two MIPS machines build gcc-11.2.1 snapshots, or even sometimes recent glibc, there's been reasonable probability that they will crash with an "Oops" (type of Linux kernel panic). The cause of the oops is tangential to the issue as far as I can tell, because what really happens is cc1plus will hit an invalid memory access attempt while compiling, which triggers the Linux/mips page faulting code on the first CPU, and while that goes on, the second CPU tries running an interrupt handler to update process ticks, which causes the kernel to attempt to dereference a NULL and thus, o= ops the machine. The machine isn't totally dead, though, which is pretty weird, cause an oops usually kills Linux. The machine will still respond to pings (intermittently) and SSH sessions will remain connected, just not respond to commands. For the last few weeks, I have been scratching my head at the oops data, and nothing makes sense about it. This bug report, however, does. Or, at leas= t, it's the best find I've come across so far. Many of the characteristics described in the original report match my setup (FreeBSD 13 on the NFS serv= er, Linux 5.4.x on the client, NFS4.x mounts, compiling gcc/cc1plus, dead-endin= g in the Linux kernel scheduler path, etc). I am currently trying to port the MIPS kernel for these machines up to the = next Linux LTS release (5.10) to see if that changes anything. I tried running the Perl script on the MIPS machine, and on the first run, = it triggered a page fault and threw a SIGSEGV due to memory exhaustion (2GB RA= M in each machine). But in multiple subsequent runs, the Perl script finished (I think), when it stopped at 226 threads before claiming it was out of memory= .=20 Could not get the machine to oops in the same way gcc/cc1plus does. Thing is, I've been running this kind of a setup for well over a year. The= two MIPS machines have been on a 5.4 kernel for at least the last six months, a= nd up until about three weeks ago, all seemed fine. Which kinda suggests the fault may really be on the Linux-side of things. I'll also add that unlike the original report, both MIPS machines run the actual gcc compile on a folder on the local disk via a bind mount. During = the compile, there shouldn't be a whole lot of NFS chatter because the way Port= age works, all of the needed package data gets saved to the build directory on = the local disk. But I can't rule out that something is still slightly wacky wi= th periodic NFS commands between the client and the server causing an issue wh= ile the machine is under stress compiling gcc. I will have to go back through recent 5.4 stable releases and look for any recent commits for Linux NFS4.x client code to see if that could explain things. But I figured I'd describe my scenario here as well in case it off= ers any clues. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-261690-227-aaomNH3FqU>