Date: Tue, 19 Apr 2022 11:47:22 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Doug Ambrisko <ambrisko@ambrisko.com> Cc: freebsd-current@freebsd.org Subject: Re: nullfs and ZFS issues Message-ID: <CAGudoHGP5MTaF_LKanCh88ufHM6mBdzicQg-KLdfw0xGA-AxJQ@mail.gmail.com> In-Reply-To: <CAGudoHGBfVFcsCbhC=MCRFPzCtVRYCa1pCU7cGuuJq1fOv6ttg@mail.gmail.com> References: <Yl31Frx6HyLVl4tE@ambrisko.com> <CAGudoHEqjs4QoAqvkvW5JdSOMZ_QNjd3XU65kULxgabsOva5Xw@mail.gmail.com> <CAGudoHGBfVFcsCbhC=MCRFPzCtVRYCa1pCU7cGuuJq1fOv6ttg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff this is not committable but should validate whether it works fine On 4/19/22, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 4/19/22, Mateusz Guzik <mjguzik@gmail.com> wrote: >> On 4/19/22, Doug Ambrisko <ambrisko@ambrisko.com> wrote: >>> I've switched my laptop to use nullfs and ZFS. Previously, I used >>> localhost NFS mounts instead of nullfs when nullfs would complain >>> that it couldn't mount. Since that check has been removed, I've >>> switched to nullfs only. However, every so often my laptop would >>> get slow and the the ARC evict and prune thread would consume two >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased >>> it to 2G now. Looking into this has uncovered some issues: >>> - nullfs would prevent vnlru_free_vfsops from doing anything >>> when called from ZFS arc_prune_task >>> - nullfs would hang onto a bunch of vnodes unless mounted with >>> nocache >>> - nullfs and nocache would break untar. This has been fixed now. >>> >>> With nullfs, nocache and settings max vnodes to a low number I can >>> keep the ARC around the max. without evict and prune consuming >>> 100% of 2 cores. This doesn't seem like the best solution but it >>> better then when the ARC starts spinning. >>> >>> Looking into this issue with bhyve and a md drive for testing I create >>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt. >>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf >>> it >>> and repeat. I set the ARC to the smallest value I can. Untarring the >>> Linux kernel was enough to get the ARC evict and prune to spin since >>> they couldn't evict/prune anything. >>> >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it >>> static int >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) >>> { >>> ... >>> >>> for (;;) { >>> ... >>> vp = TAILQ_NEXT(vp, v_vnodelist); >>> ... >>> >>> /* >>> * Don't recycle if our vnode is from different type >>> * of mount point. Note that mp is type-safe, the >>> * check does not reach unmapped address even if >>> * vnode is reclaimed. >>> */ >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && >>> mp->mnt_op != mnt_op) { >>> continue; >>> } >>> ... >>> >>> The vp ends up being the nulfs mount and then hits the continue >>> even though the passed in mvp is on ZFS. If I do a hack to >>> comment out the continue then I see the ARC, nullfs vnodes and >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. >>> The ARC cache usage also goes down. Then they increase again until >>> the ARC gets full and then they go down again. So with this hack >>> I don't need nocache passed to nullfs and I don't need to limit >>> the max vnodes. Doing multiple untars in parallel over and over >>> doesn't seem to cause any issues for this test. I'm not saying >>> commenting out continue is the fix but a simple POC test. >>> >> >> I don't see an easy way to say "this is a nullfs vnode holding onto a >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs >> callback, if the module is loaded. >> >> In the meantime I think a good enough(tm) fix would be to check that >> nothing was freed and fallback to good old regular clean up without >> filtering by vfsops. This would be very similar to what you are doing >> with your hack. >> > > Now that I wrote this perhaps an acceptable hack would be to extend > struct mount with a pointer to "lower layer" mount (if any) and patch > the vfsops check to also look there. > >> >>> It appears that when ZFS is asking for cached vnodes to be >>> free'd nullfs also needs to free some up as well so that >>> they are free'd on the VFS level. It seems that vnlru_free_impl >>> should allow some of the related nullfs vnodes to be free'd so >>> the ZFS ones can be free'd and reduce the size of the ARC. >>> >>> BTW, I also hacked the kernel and mount to show the vnodes used >>> per mount ie. mount -v: >>> test on /test (zfs, NFS exported, local, nfsv4acls, fsid >>> 2b23b2a1de21ed66, >>> vnodes: count 13846 lazy 0) >>> /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid >>> 11ff002929000000, vnodes: count 13846 lazy 0) >>> >>> Now I can easily see how the vnodes are used without going into ddb. >>> On my laptop I have various vnet jails and nullfs mount my homedir into >>> them so pretty much everything goes through nullfs to ZFS. I'm limping >>> along with the nullfs nocache and small number of vnodes but it would be >>> nice to not need that. >>> >>> Thanks, >>> >>> Doug A. >>> >>> >> >> >> -- >> Mateusz Guzik <mjguzik gmail.com> >> > > > -- > Mateusz Guzik <mjguzik gmail.com> > -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHGP5MTaF_LKanCh88ufHM6mBdzicQg-KLdfw0xGA-AxJQ>