Date: Mon, 21 Oct 2024 10:54:42 +0000 From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 282169] zfs rename deadlock with mountd, df & fstat (and possibly others) Message-ID: <bug-282169-3630-cRAMp5CO6y@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-282169-3630@https.bugs.freebsd.org/bugzilla/> References: <bug-282169-3630@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D282169 --- Comment #3 from Peter Eriksson <pen@lysator.liu.se> --- I'll see if I can provoke the same deadlock. Perhaps not on a production se= rver with many users next time though...=20 I've never been able to get a good kernel dump though when I've tried before but I'll see if I can get it working... The machines have like 512-640GB of= RAM and normally no swap space configured, and we're using an all-ZFS setup so I need to set up some special disks for the dump. I've been looking thru the procstat output in order to try to identify some suspicious processes that might have taken some lock but no obvious candida= tes pop up for me. Perhaps "df" or "procstat" itself. procstat seems to be inside some function called sysctl_root_handler_locked. At the time of the deadlock, besides me doing a lot of "zfs rename" operati= ons there was a backup running (using rsync) that possibly might have been accessing some of the filesystems I was renaming. Also I have the system monitoring script that runs every minute doing stuff (protected with a lock file so I won't end up running a gazillion copies in case something takes a very long time) like "procstat -kk -a", "fstat", "zfs-stats" (and more stuff) that definitively ran a number of times at the same time (that script runs 24/7 and has for many years now). The last outp= ut from that script happened at 00:27 (blocked on "fstat") indicates (from the= "ps auxwww" output) that was happening at the time was: 1. "nzfs clean -y -P10 -L500 -e -E :ttl -T 8h -r -v -V1 DATA/students" ("nzfs" is a special local version of the "zfs" command that implements a "clean" option to more efficiently handle snapshot deletion), but that was cleaing up stuff under DATA/students", the archiving I was doing was under "DATA/staff". 2. 00:23 root-owned <defunct> process started (zpool iostat) 3. 00:20 "fstat" was started and blocked. Not much active users at that time though (around midnight :-) but some cli= ents where connected. Looking at the saved /var/log/messages output from that time mountd complai= ned at 00:18 about a number of students filesystems with wrong sharenfs attribu= tes (triggered by a zfs rename operation). and then 00:19-00:26 some rsync erro= rs about change_dir to staff/<user> failing (since they were archived at that time). Ah well... --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-282169-3630-cRAMp5CO6y>