Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Dec 2013 19:57:03 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Jason Keltz <jas@cse.yorku.ca>
Cc:        FreeBSD Filesystems <freebsd-fs@freebsd.org>, Steve Dickson <SteveD@redhat.com>
Subject:   Re: mount ZFS snapshot on Linux system
Message-ID:  <461272120.32852470.1387414623832.JavaMail.root@uoguelph.ca>
In-Reply-To: <52A7E53D.8000002@cse.yorku.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
Jason Keltz wrote:
> On 10/12/2013 7:21 PM, Rick Macklem wrote:
> > Jason Keltz wrote:
> >> I'm running FreeBSD 9.2 with various ZFS datasets.
> >> I export a dataset to a Linux system (RHEL64), and mount it.  It
> >> works
> >> fine...
> >> When I try to access the ZFS snapshot directory on the Linux NFS
> >> client,
> >> things go weird.
> >>
Ok, thanks to Jason's help testing, I've been chasing this down.
(I also bumped into the comments in zfs_ctldir.c which are
 interesting. They include:
 * File systems mounted ontop of the GFS nodes '.zfs/snapshot/<snapname>'
 * (ie: snapshots) are ZFS nodes and have their own unique vfs_t.
 * However, vnodes within these mounted on file systems have their v_vfsp
 * fields set to the head filesystem to make NFS happy (see
 * zfsctl_snapdir_lookup()). We VFS_HOLD the head filesystem's vfs_t
 * so that it cannot be freed until all snapshots have been unmounted.
Is this comment from upstream code or a part of the FreeBSD port?

The "make NFS happy" part seems questionable. It appears that it pretends
that the automounts of the snapshots are a part of the same file system
as .zfs/snapshot. The problem is that the i-node#s (or filenos, if you
prefer) are duplicated (the root of each snapshot is 4, for example).
This will cause a variety of problems for NFS clients, since filenos are
assumed to refer to one and only one file object within a file system.

I have a patch that I think does correctly return attributes to a Readdir
etc to clients, so that NFSv4 clients see them as separate file systems
(different fsids for each snapshot, and mounted_on_fileno != fileno for the
 snapshot fake mounts). The patch also expands the cases where Readdirplus in
the NFS server switches from VFS_VGET() to VOP_LOOKUP() to include Readdir of
.zfs/snapshot, so it doesn't get attributes for the fake mounted on vnode.

The current patch is at:
http://people.freebsd.org/~rmacklem/nfsv4-zfs-snapshot.patch

Now, I have no idea what to do with NFSv3. Since NFSv3 can't cross
server mount points and expects a mount point to exhibit the fileno
only represents one file object property, NFSv3 shouldn't "see" anything
in the snapshot directories when .zfs/snapshot is mounted. (ie.
.zfs/snapshot/20131209 would just be an empty dir.)
To get the contents of .zfs/snapshot/20131209 it would have to
mount .zfs/snapshot/20131209.
I'm not exactly sure what actually happens, but it isn't the above.

Any opinions on what is the correct handling of these for NFS?
(Or people willing to test the patch.)

Thanks, rick
ps: Pawel, I've added you as a cc, since you did the original
    switch from VFS_VGET()->VOP_LOOKUP() patch.

> >> With NFSv4:
> >>
> >> [jas@archive /]# cd /mnt/.zfs/snapshot
> >> [jas@archive snapshot]# ls
> >> 20131203  20131205  20131206  20131207  20131208  20131209
> >>  20131210
> >> [jas@archive snapshot]# cd 20131210
> >> 20131210: Not a directory.
> >>
> >> huh?
> >>
> >> [jas@archive snapshot]# ls -al
> >> total 77
> >> dr-xr-xr-x   9 root root   9 Dec 10 11:20 .
> >> dr-xr-xr-x   4 root root   4 Nov 28 15:42 ..
> >> drwxr-xr-x 380 root root 380 Dec  2 15:56 20131203
> >> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131205
> >> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131206
> >> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131207
> >> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131208
> >> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131209
> >> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131210
> >> [jas@archive snapshot]# stat *
> >> [jas@archive snapshot]# ls -al
> >> total 292
> >> dr-xr-xr-x 9 root      root         9 Dec 10 11:20 .
> >> dr-xr-xr-x 4 root      root         4 Nov 28 15:42 ..
> >> -rw-r--r-- 1 uax    guest   137647 Mar 17  2010 20131203
> >> -rw-r--r-- 1 uax    guest         865 Jul 31  2009 20131205
> >> -rw-r--r-- 1 uax    guest   137647 Mar 17  2010 20131206
> >> -rw-r--r-- 1 uax    guest         771 Jul 31  2009 20131207
> >> -rw-r--r-- 1 uax    guest         778 Jul 31  2009 20131208
> >> -rw-r--r-- 1 uax     guest       5281 Jul 31  2009 20131209
> >> -rw------- 1 btx      faculty      893 Jul 13 20:21 20131210
> >>
> >> But it gets even more fun..
> >>
> >> # ls -ali
> >> total 205
> >>     2 dr-xr-xr-x   9 root      root       9 Dec 10 11:20 .
> >>     1 dr-xr-xr-x   4 root      root       4 Nov 28 15:42 ..
> >> 863 -rw-r--r--   1 uax     guest 137647 Mar 17  2010 20131203
> >>     4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131205
> >>     4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131206
> >>     4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131207
> >>     4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131208
> >>     4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131209
> >>     4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131210
> >>
> >> This is not a user id mapping issue because all the files in /mnt
> >> have
> >> the proper owner/groups, and I can access them there fine.
> >>
> >> I also tried explicitly exporting .zfs/snapshot.  The result isn't
> >> any
> >> different.
> >>
> >> If I use nfs v3 it "works", but I'm seeing a whole lot of errors
> >> like
> >> these in syslog:
> >>
> >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >> /local/backup/home9/.zfs/snapshot/20131203: Invalid argument
> >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >> /local/backup/home9/.zfs/snapshot/20131209: Invalid argument
> >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >> /local/backup/home9/.zfs/snapshot/20131210: Invalid argument
> >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
> >> /local/backup/home9/.zfs/snapshot/20131207: Invalid argument
> >>
> >> It's not clear to me why this doesn't just "work".
> >>
> >> Can anyone provide any advice on debugging this?
> >>
> > As I think you already know, I know nothing about ZFS and never
> > use it.
> Yup! :)
> > Having said that, I suspect that there are filenos (i-node #s)
> > that are the same in the snapshot as in the parent file system
> > tree.
> >
> > The basic assumptions are:
> > - within a file system, all i-node# are unique (represent one file
> >    object only) and all file objects have the same fsid
> > - when the fsid changes, that indicates a file system boundary and
> >    fileno (i-node#s) can be reused in the subtree with a different
> >    fsid
> >
> > For NFSv3, the server should export single volumes only (all
> > objects
> > have the same fsid and the filenos are unique). This is indicated
> > to
> > the VFS by the use of the NOCROSSMOUNT flag on VOP_LOOKUP() and
> > friends.
> >
> > For NFSv4, the server does export multiple volumes and the boundary
> > is indicated by a change in fsid value.
> >
> > I suspect ZFS snaphots don't obey the above in some way, but that
> > is
> > just a hunch.
> >
> > Now, how to narrow this down...
> > - Do the above tests (both NFSv4 and NFSv3) and capture the
> > packets,
> >    then look at them in wireshark. In particular, look at the
> >    fileid numbers
> >    and fsid values for the various directories under .zfs.
> 
> I gave this a shot, but I haven't used wireshark to capture NFS
> traffic
> before, so if I need to provide additional details, let me know..
> 
> NFSv4:
> 
> For /mnt/.zfs/snapshot/20131203:
> fileid=4
> fsid4.major=1446349656
> fsid4.minor=222
> 
> For /mnt/.zfs/snapshot/20131205:
> fileid=4
> fsid4.major=1845998066
> fsid4.minor=222
> 
> For /mnt/jas:
> fileid=144
> fsid4.major=597946950
> fsid4.minor=222
> 
> For /mnt/jas1:
> fileid=338
> fsid4.major=597946950
> fsid4.minor=222
> 
> So fsid is the same for all the different "data" directories, which
> is
> what I would expect given what you said.  I  guess each snapshot is
> seen
> as a unique filesystem...  but then a repeating inode in different
> filesystems shouldn't be a problem...
> 
> NFSv3:
> 
> For /mnt/.zfs/snapshot/20131203:
> fileid=4
> fsid=0x0000000056358b58
> 
> For /mnt/.zfs/snapshot/20131205:
> fileid=4
> fsid=0x000000006e07b1f2
> 
> For /mnt/jas
> fileid=144
> fsid=0x0000000023a3f246
> 
> For /mnt/jas1:
> fileid=338
> fsid=0x0000000023a3f246
> 
> Here, it seems it's the same, even though it's NFSv3... hmm.
> 
> 
> > - Try mounting the individual snapshot directory, like
> >     .zfs/snapshot/20131209 and see if that works (for both NFSv3
> >     and NFSv4).
> 
> Hmm .. I tried this:
> 
> /local/backup/home9/.zfs/snapshot/20131203  -ro
> archive-mrpriv.cs.yorku.ca
> V4: /
> 
> ... but syslog reports:
> 
> Dec 10 22:28:22 jungle mountd[85405]: can't export
> /local/backup/home9/.zfs/snapshot/20131203
> 
> ... and of course I can't mount from either v3/v4.
> 
> On the other hand, I kept it as:
> 
> /local/backup/home9 -ro archive-mrpriv.cs.yorku.ca
> V4:/
> 
> ... and was able to NFSv4 mount
> /local/backup/home9/.zfs/snapshot/20131203, and this does indeed
> work.
> 
> > - Try doing the mounts with a FreeBSD client and see if you get the
> > same
> >    behaviour?
> I found this:
> http://forums.freenas.org/threads/mounting-snapshot-directory-using-nfs-from-linux-broken.6060/
> .. implies it will work from FreeBSD/Nexenta, just not Linux.
> Found this as well:
> https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/lKyfYsjPMNM
> 
> Jason.
> 
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?461272120.32852470.1387414623832.JavaMail.root>