Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 15:05:27 -0400
From:      Garrett Wollman <wollman@bimajority.org>
To:        freebsd-stable@freebsd.org
Cc:        Mateusz Guzik <mjguzik@gmail.com>
Subject:   Re: Did something change with ZFS and vnode caching?
Message-ID:  <25840.58487.468791.344785@hergotha.csail.mit.edu>
In-Reply-To: <25831.30103.446606.733311@hergotha.csail.mit.edu>
References:  <25827.33600.611577.665054@hergotha.csail.mit.edu> <25831.30103.446606.733311@hergotha.csail.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
<<On Thu, 24 Aug 2023 11:21:59 -0400, Garrett Wollman <wollman@bimajority.org> said:

> Any suggestions on what we should monitor or try to adjust?

To bring everyone up to speed: earlier this month we upgraded our NFS
servers from 12.4 to 13.2 and found that our backup system was
absolutely destroying NFS performance, which had not happened before.

With some pointers from mjg@ and the thread relating to ZFS
performance on current@ I built a stable/13 kernel
(b5a5a06fc012d27c6937776bff8469ea465c3873) and installed it on one of
our NFS servers for testing, then removed the band-aid on our backup
system and allowed it to go as parallel as it wanted.

Unfortunately, we do not control the scheduling of backup jobs, so
it's difficult to tell whether the changes made any difference.  Each
backup job does a parallel breadth-first traversal of a given
filesystem, using as many as 150 threads per job (the backup client
auto-scales itself), and we sometimes see as many as eight jobs
running in parallel on one file server.  (There are 17, soon to be 18,
file servers.)  

When the performance of NFS's backing store goes to hell, the NFS
server is not able to put back-pressure on the clients hard enough to
stop them from writing, and eventually the server runs out of 4k jumbo
mbufs and crashes.  This at least is a known failure mode, going back
a decade.  Before it gets to this point, the NFS server also
auto-scales itself, so it's in competition with the backup client over
who can create the most threads and ultimately allocate the most
vnodes.

Last night, while I was watching, the first dozen or so backups went
fine, with no impact to NFS performance, until the backup server
decided to schedule scans of two, and then three, parallel scans of
filesystems containing about 35 million files each.  These tend to
take an hour or four, depending on how much changed data is identified
during the scane, but most of the time it's just sitting in a
readdir()/fstatat() loop with a shared work queue for parallelism.
(That's my interpretation based on its activity; we do not have source
code.)

Once these scans were underway, I observed the same symptoms as on
releng/13.2, with lots of lock contention and the vnlru process
running almost constantly (95% CPU, so most of a core on this
20-core/40-thread server).  From our monitoring, the server was
recycling about 35k vnodes per second during this period.  I wasn't
monitoring these statistics before so I don't have historical
comparisons.  My working assumption, such as it is, is that the switch
from OpnSolaris ZFS to OpenZFS in 13.x moved some bottlenecks around
so that the backup client previously got tangled higher up in the ZFS
code and now can put real pressure on the vnode allocator.

During the hour that the three backup clients were running, I was able
to run mjg@'s dtrace script and generate a flame graph, which is
viewable at <https://people.csail.mit.edu/wollman/dtrace-terad.2.svg>.
This just shows what the backup clients themselves are doing, and not
what's going on in the vnlru or nfsd processes.  You can ignore all
the umtx stacks since that's just coordination between the threads in
the backup client.

On the "oncpu" side, the trace captures a lot of time spent spinning
in lock_delay(), although I don't see where the alleged call site
acquires any locks, so there must have been some inlining.  On the
"offcpu" side, it's clear that there's still a lot of time spent
sleeping on vnode_list_mtx in the vnode allocation pathway, both
directly from vn_alloc_hard() and also from vnlru_free_impl() after
the mutex is dropped and then needs to be reacquired.

In ZFS, there's also a substantial number of waits (shown as
sx_xlock_hard stack frames), in both the easy case (a free vnode was
readily available) and the hard case where vn_alloc_hard() calls
vnlru_free_impl() and eventually zfs_inactive() to reclaim a vnode.
Looking into the implementation, I noted that ZFS uses a 64-entry hash
lock for this, and I'm wondering if there's an issue with false
sharing.  Can anyone with ZFS experience speak to that?  If I
increased ZFS_OBJ_MTX_SZ to 128 or 256, would it be likely to hurt
something else (other than memory usage)?  Do we even know that the
low-order 6 bits of ZFS object IDs are actually uniformly distributed?

-GAWollman




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?25840.58487.468791.344785>