Date: Thu, 31 Aug 2023 15:05:27 -0400 From: Garrett Wollman <wollman@bimajority.org> To: freebsd-stable@freebsd.org Cc: Mateusz Guzik <mjguzik@gmail.com> Subject: Re: Did something change with ZFS and vnode caching? Message-ID: <25840.58487.468791.344785@hergotha.csail.mit.edu> In-Reply-To: <25831.30103.446606.733311@hergotha.csail.mit.edu> References: <25827.33600.611577.665054@hergotha.csail.mit.edu> <25831.30103.446606.733311@hergotha.csail.mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
<<On Thu, 24 Aug 2023 11:21:59 -0400, Garrett Wollman <wollman@bimajority.org> said: > Any suggestions on what we should monitor or try to adjust? To bring everyone up to speed: earlier this month we upgraded our NFS servers from 12.4 to 13.2 and found that our backup system was absolutely destroying NFS performance, which had not happened before. With some pointers from mjg@ and the thread relating to ZFS performance on current@ I built a stable/13 kernel (b5a5a06fc012d27c6937776bff8469ea465c3873) and installed it on one of our NFS servers for testing, then removed the band-aid on our backup system and allowed it to go as parallel as it wanted. Unfortunately, we do not control the scheduling of backup jobs, so it's difficult to tell whether the changes made any difference. Each backup job does a parallel breadth-first traversal of a given filesystem, using as many as 150 threads per job (the backup client auto-scales itself), and we sometimes see as many as eight jobs running in parallel on one file server. (There are 17, soon to be 18, file servers.) When the performance of NFS's backing store goes to hell, the NFS server is not able to put back-pressure on the clients hard enough to stop them from writing, and eventually the server runs out of 4k jumbo mbufs and crashes. This at least is a known failure mode, going back a decade. Before it gets to this point, the NFS server also auto-scales itself, so it's in competition with the backup client over who can create the most threads and ultimately allocate the most vnodes. Last night, while I was watching, the first dozen or so backups went fine, with no impact to NFS performance, until the backup server decided to schedule scans of two, and then three, parallel scans of filesystems containing about 35 million files each. These tend to take an hour or four, depending on how much changed data is identified during the scane, but most of the time it's just sitting in a readdir()/fstatat() loop with a shared work queue for parallelism. (That's my interpretation based on its activity; we do not have source code.) Once these scans were underway, I observed the same symptoms as on releng/13.2, with lots of lock contention and the vnlru process running almost constantly (95% CPU, so most of a core on this 20-core/40-thread server). From our monitoring, the server was recycling about 35k vnodes per second during this period. I wasn't monitoring these statistics before so I don't have historical comparisons. My working assumption, such as it is, is that the switch from OpnSolaris ZFS to OpenZFS in 13.x moved some bottlenecks around so that the backup client previously got tangled higher up in the ZFS code and now can put real pressure on the vnode allocator. During the hour that the three backup clients were running, I was able to run mjg@'s dtrace script and generate a flame graph, which is viewable at <https://people.csail.mit.edu/wollman/dtrace-terad.2.svg>. This just shows what the backup clients themselves are doing, and not what's going on in the vnlru or nfsd processes. You can ignore all the umtx stacks since that's just coordination between the threads in the backup client. On the "oncpu" side, the trace captures a lot of time spent spinning in lock_delay(), although I don't see where the alleged call site acquires any locks, so there must have been some inlining. On the "offcpu" side, it's clear that there's still a lot of time spent sleeping on vnode_list_mtx in the vnode allocation pathway, both directly from vn_alloc_hard() and also from vnlru_free_impl() after the mutex is dropped and then needs to be reacquired. In ZFS, there's also a substantial number of waits (shown as sx_xlock_hard stack frames), in both the easy case (a free vnode was readily available) and the hard case where vn_alloc_hard() calls vnlru_free_impl() and eventually zfs_inactive() to reclaim a vnode. Looking into the implementation, I noted that ZFS uses a 64-entry hash lock for this, and I'm wondering if there's an issue with false sharing. Can anyone with ZFS experience speak to that? If I increased ZFS_OBJ_MTX_SZ to 128 or 256, would it be likely to hurt something else (other than memory usage)? Do we even know that the low-order 6 bits of ZFS object IDs are actually uniformly distributed? -GAWollman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?25840.58487.468791.344785>