Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Jun 2014 16:48:47 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Bryan Drewery <bdrewery@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: getdirentries cookies usage outside of UFS [PATCH]
Message-ID:  <1142861946.4612955.1403815727424.JavaMail.root@uoguelph.ca>
In-Reply-To: <53AC8377.4060408@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bryan Drewery wrote:
> On 2014-04-14 17:27, Kirk McKusick wrote:
> >> Date: Fri, 11 Apr 2014 21:03:57 -0500
> >> From: Bryan Drewery <bdrewery@freebsd.org>
> >> To: freebsd-fs@freebsd.org
> >> Subject: getdirentries cookies usage outside of UFS
> >>
> >> Recently I was working on a compat syscall for sys_getdirentries()
> >> that
> >> converts between our dirent and the FreeBSD dirent struct. We had
> >> never
> >> tried using this on TMPFS and when we did ran into weird issues
> >> (hence
> >> my recent commits to TMPFS to clarify some of the getdirentries()
> >> code).
> >> We were not using cookies, so I referenced the Linux compat module
> >> (linux_file.c getdents_common())
> >>
> >> I ran across this code:
> >>
> >>>     /*
> >>>      * When using cookies, the vfs has the option of reading from
> >>>      * a different offset than that supplied (UFS truncates the
> >>>      * offset to a block boundary to make sure that it never
> >>>      reads
> >>>      * partway through a directory entry, even if the directory
> >>>      * has been compacted).
> >>>      */
> >>>     while (len > 0 && ncookies > 0 && *cookiep <= off) {
> >>>             bdp = (struct dirent *) inp;
> >>>             len -= bdp->d_reclen;
> >>>             inp += bdp->d_reclen;
> >>>             cookiep++;
> >>>             ncookies--;
> >>>     }=20
> >>
> >>
> >> At first it looked innocuous but then it occurred to me it was the
> >> root
> >> of the issue I was having as it was eating my cookies based on
> >> their
> >> value, despite tmpfs cookies being random hash values that have no
> >> sequential relation. So I looked at how NFS was handling the same
> >> code
> >> and found this lovely hack from r216691:
> >>
> >>> not_zfs = strcmp(vp->v_mount->mnt_vfc->vfc_name, "zfs");
> >> =2E..
> >>>  while (cpos < cend && ncookies > 0 &&
> >>>      (dp->d_fileno == 0 || dp->d_type == DT_WHT ||
> >>>       (not_zfs != 0 && ((u_quad_t)(*cookiep)) <= toff))) {
> >>>          cpos += dp->d_reclen;
> >>>          dp = (struct dirent *)cpos;
> >>>          cookiep++;
> >>>          ncookies--;
> >>>  }
> >>
> >> I ended up doing the opposite, only running the code if getting
> >> dirents
> >> from "ufs".
> >>
> >> So there's multiple issue here.
> >>
> >> 1. NFS is broken on TMPFS. I can see why it's gone so long
> >> unnoticed,
> >> why would you do that? Still probably worth fixing.
> >>
> >> 2. Linux and SVR4 getdirentries() are both broken on TMPFS/ZFS. I
> >> am
> >> surprised Linux+ZFS has not been noticed by now. I am aware the
> >> SVR4
> >> is
> >> full of other bugs too. I ran across many just reviewing the
> >> getdirentries code alone.
> >>
> >> Do any other file systems besides UFS do this offset/cookie
> >> truncation/rewind? If UFS is the only one it may be acceptable to
> >> change
> >> this zfs check to !ufs and add it to the other modules. If we
> >> don't
> >> like
> >> that, or there are potentially other file systems doing this too,
> >> how
> >> about adding a flag to somewhere to indicate the file system has
> >> monotonically increasing offsets and needs this rewind support.
> >> I'm
> >> not
> >> sure where that is best done, struct vfsconf?
> >>
> >> Regards,
> >> Bryan Drewery
> >
> > This code is specific to UFS. I concur with your fix of making
> > it conditionl on UFS. I feel guilty for putting that code in
> > unconditionally in the first place. In my defense it was 1982
> > and UFS was the only filesystem :-)
> >
> > 	Kirk McKusick
> 
> Based on this discussion I have made the following patch against NFS.
> If
> we're comfortable with this approach I will apply the same logic to
> the
> Linux and SVR4 modules.
> 
> Mirrored at
> http://people.freebsd.org/~bdrewery/patches/nfs-zfs-ufs.patch
> 
> The patch changes 'not_zfs' to 'is_ufs' in the NFS code. Some of the
> code actually is ZFS-specific in regards to snapshot handling. So I
> have
> also added a 'is_zfs' variable to compare against for those cases.
> 
> I've removed the comments about ZFS in the UFS-only cases as the
> existing comment seems to cover it fine.
> 
> (Unrelated) This code, from r259845, in
> sys/fs/nfsserver/nfs_nfsdport.c
> seems odd to me:
> 
>          /*
>           * Check to see if entries in this directory can be safely
>           acquired
>           * via VFS_VGET() or if a switch to VOP_LOOKUP() is
>           required.
>           * ZFS snapshot directories need VOP_LOOKUP(), so that any
>           * automount of the snapshot directory that is required will
>           * be done.
>           * This needs to be done here for NFSv4, since NFSv4 never
>           does
>           * a VFS_VGET() for "." or "..".
>           */
> -       if (not_zfs == 0) {
> +       if (is_zfs == 1) {
>                  r = VFS_VGET(mp, at.na_fileid, LK_SHARED, &nvp);
>                  if (r == EOPNOTSUPP) {
>                          usevget = 0;
>                          cn.cn_nameiop = LOOKUP;
>                          cn.cn_lkflags = LK_SHARED | LK_RETRY;
>                          cn.cn_cred = nd->nd_cred;
>                          cn.cn_thread = p;
>                  } else if (r == 0)
>                          vput(nvp);
>          }
> 
> This fallback is also done later (from r199715), but not limited to
> ZFS. Would it make sense to not limit this first check to ZFS as
> well? I
> see that unionfs_vget also returns EOPNOTSUPP. A nullfs mount from
> ZFS
> served over NFS may also return EOPNOTSUPP, as odd as that is.
> 
Well, the answer is similar to the not_zfs one. I don't have any way
to test all the different file systems to make sure a change doesn't
introduce a regression for any of them. In this case, someone was
able to confirm that this changed fixed a problem for ZFS, so I
put the code in as ZFS specific.

The change further down was done by pjd@ and not me. Since he did
it for all file systems in the old server, I duplicated that.
(At that point, there was no ZFS specific stuff.)

Enabling it for all file system types introduces a little overhead,
but I would agree with that if it was known that it did not cause
a regression for any file system type, then it would be nice to
avoid a file system specific chunk.

Bottom line. If you test this for all file systems and/or check the
code for all file system types and are convinced it is safe to do
for all file system types, then I have no problem with you doing
a commit to change it.

rick
ps: I'll look at the patch, but I imagine it's fine with me.
    Basically, if Kirk thinks it's UFS specific, I believe him.

> 
> --
> Regards,
> Bryan Drewery
> 
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1142861946.4612955.1403815727424.JavaMail.root>