Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 Oct 2024 10:54:42 +0000
From:      bugzilla-noreply@freebsd.org
To:        fs@FreeBSD.org
Subject:   [Bug 282169] zfs rename deadlock with mountd, df & fstat (and possibly others)
Message-ID:  <bug-282169-3630-cRAMp5CO6y@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-282169-3630@https.bugs.freebsd.org/bugzilla/>
References:  <bug-282169-3630@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D282169

--- Comment #3 from Peter Eriksson <pen@lysator.liu.se> ---
I'll see if I can provoke the same deadlock. Perhaps not on a production se=
rver
with many users next time though...=20

I've never been able to get a good kernel dump though when I've tried before
but I'll see if I can get it working... The machines have like 512-640GB of=
 RAM
and normally no swap space configured, and we're using an all-ZFS setup so I
need to  set up some special disks for the dump.


I've been looking thru the procstat output in order to try to identify some
suspicious processes that might have taken some lock but no obvious candida=
tes
pop up for me. Perhaps "df" or "procstat" itself.

procstat seems to be inside some function called sysctl_root_handler_locked.

At the time of the deadlock, besides me doing a lot of "zfs rename" operati=
ons
there was a backup running (using rsync) that possibly might have been
accessing some of the filesystems I was renaming.

Also I have the system monitoring script that runs every minute doing stuff
(protected with a lock file so I won't end up running a gazillion copies in
case something takes a very long time) like "procstat -kk -a", "fstat",
"zfs-stats" (and more stuff) that definitively ran a number of times at the
same time (that script runs 24/7 and has for many years now). The last outp=
ut
from that script happened at 00:27 (blocked on "fstat") indicates (from the=
 "ps
auxwww" output) that was happening at the time was:

1. "nzfs clean -y -P10 -L500 -e -E :ttl -T 8h -r -v -V1 DATA/students"
 ("nzfs" is a special local version of the "zfs" command that implements a
"clean" option to more efficiently handle snapshot deletion), but that was
cleaing up stuff under DATA/students", the archiving I was doing was under
"DATA/staff".

2. 00:23 root-owned <defunct> process started (zpool iostat)

3. 00:20 "fstat" was started and blocked.


Not much active users at that time though (around midnight :-) but some cli=
ents
where connected.

Looking at the saved /var/log/messages output from that time mountd complai=
ned
at 00:18 about a number of students filesystems with wrong sharenfs attribu=
tes
(triggered by a zfs rename operation). and then 00:19-00:26 some rsync erro=
rs
about change_dir to staff/<user> failing (since they were archived at that
time).

Ah well...

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-282169-3630-cRAMp5CO6y>