Date: Wed, 18 Dec 2013 19:57:03 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Jason Keltz <jas@cse.yorku.ca> Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>, Steve Dickson <SteveD@redhat.com> Subject: Re: mount ZFS snapshot on Linux system Message-ID: <461272120.32852470.1387414623832.JavaMail.root@uoguelph.ca> In-Reply-To: <52A7E53D.8000002@cse.yorku.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
Jason Keltz wrote: > On 10/12/2013 7:21 PM, Rick Macklem wrote: > > Jason Keltz wrote: > >> I'm running FreeBSD 9.2 with various ZFS datasets. > >> I export a dataset to a Linux system (RHEL64), and mount it. It > >> works > >> fine... > >> When I try to access the ZFS snapshot directory on the Linux NFS > >> client, > >> things go weird. > >> Ok, thanks to Jason's help testing, I've been chasing this down. (I also bumped into the comments in zfs_ctldir.c which are interesting. They include: * File systems mounted ontop of the GFS nodes '.zfs/snapshot/<snapname>' * (ie: snapshots) are ZFS nodes and have their own unique vfs_t. * However, vnodes within these mounted on file systems have their v_vfsp * fields set to the head filesystem to make NFS happy (see * zfsctl_snapdir_lookup()). We VFS_HOLD the head filesystem's vfs_t * so that it cannot be freed until all snapshots have been unmounted. Is this comment from upstream code or a part of the FreeBSD port? The "make NFS happy" part seems questionable. It appears that it pretends that the automounts of the snapshots are a part of the same file system as .zfs/snapshot. The problem is that the i-node#s (or filenos, if you prefer) are duplicated (the root of each snapshot is 4, for example). This will cause a variety of problems for NFS clients, since filenos are assumed to refer to one and only one file object within a file system. I have a patch that I think does correctly return attributes to a Readdir etc to clients, so that NFSv4 clients see them as separate file systems (different fsids for each snapshot, and mounted_on_fileno != fileno for the snapshot fake mounts). The patch also expands the cases where Readdirplus in the NFS server switches from VFS_VGET() to VOP_LOOKUP() to include Readdir of .zfs/snapshot, so it doesn't get attributes for the fake mounted on vnode. The current patch is at: http://people.freebsd.org/~rmacklem/nfsv4-zfs-snapshot.patch Now, I have no idea what to do with NFSv3. Since NFSv3 can't cross server mount points and expects a mount point to exhibit the fileno only represents one file object property, NFSv3 shouldn't "see" anything in the snapshot directories when .zfs/snapshot is mounted. (ie. .zfs/snapshot/20131209 would just be an empty dir.) To get the contents of .zfs/snapshot/20131209 it would have to mount .zfs/snapshot/20131209. I'm not exactly sure what actually happens, but it isn't the above. Any opinions on what is the correct handling of these for NFS? (Or people willing to test the patch.) Thanks, rick ps: Pawel, I've added you as a cc, since you did the original switch from VFS_VGET()->VOP_LOOKUP() patch. > >> With NFSv4: > >> > >> [jas@archive /]# cd /mnt/.zfs/snapshot > >> [jas@archive snapshot]# ls > >> 20131203 20131205 20131206 20131207 20131208 20131209 > >> 20131210 > >> [jas@archive snapshot]# cd 20131210 > >> 20131210: Not a directory. > >> > >> huh? > >> > >> [jas@archive snapshot]# ls -al > >> total 77 > >> dr-xr-xr-x 9 root root 9 Dec 10 11:20 . > >> dr-xr-xr-x 4 root root 4 Nov 28 15:42 .. > >> drwxr-xr-x 380 root root 380 Dec 2 15:56 20131203 > >> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131205 > >> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131206 > >> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131207 > >> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131208 > >> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131209 > >> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131210 > >> [jas@archive snapshot]# stat * > >> [jas@archive snapshot]# ls -al > >> total 292 > >> dr-xr-xr-x 9 root root 9 Dec 10 11:20 . > >> dr-xr-xr-x 4 root root 4 Nov 28 15:42 .. > >> -rw-r--r-- 1 uax guest 137647 Mar 17 2010 20131203 > >> -rw-r--r-- 1 uax guest 865 Jul 31 2009 20131205 > >> -rw-r--r-- 1 uax guest 137647 Mar 17 2010 20131206 > >> -rw-r--r-- 1 uax guest 771 Jul 31 2009 20131207 > >> -rw-r--r-- 1 uax guest 778 Jul 31 2009 20131208 > >> -rw-r--r-- 1 uax guest 5281 Jul 31 2009 20131209 > >> -rw------- 1 btx faculty 893 Jul 13 20:21 20131210 > >> > >> But it gets even more fun.. > >> > >> # ls -ali > >> total 205 > >> 2 dr-xr-xr-x 9 root root 9 Dec 10 11:20 . > >> 1 dr-xr-xr-x 4 root root 4 Nov 28 15:42 .. > >> 863 -rw-r--r-- 1 uax guest 137647 Mar 17 2010 20131203 > >> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 20131205 > >> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 20131206 > >> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 20131207 > >> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 20131208 > >> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 20131209 > >> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 20131210 > >> > >> This is not a user id mapping issue because all the files in /mnt > >> have > >> the proper owner/groups, and I can access them there fine. > >> > >> I also tried explicitly exporting .zfs/snapshot. The result isn't > >> any > >> different. > >> > >> If I use nfs v3 it "works", but I'm seeing a whole lot of errors > >> like > >> these in syslog: > >> > >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >> /local/backup/home9/.zfs/snapshot/20131203: Invalid argument > >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >> /local/backup/home9/.zfs/snapshot/20131209: Invalid argument > >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >> /local/backup/home9/.zfs/snapshot/20131210: Invalid argument > >> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >> /local/backup/home9/.zfs/snapshot/20131207: Invalid argument > >> > >> It's not clear to me why this doesn't just "work". > >> > >> Can anyone provide any advice on debugging this? > >> > > As I think you already know, I know nothing about ZFS and never > > use it. > Yup! :) > > Having said that, I suspect that there are filenos (i-node #s) > > that are the same in the snapshot as in the parent file system > > tree. > > > > The basic assumptions are: > > - within a file system, all i-node# are unique (represent one file > > object only) and all file objects have the same fsid > > - when the fsid changes, that indicates a file system boundary and > > fileno (i-node#s) can be reused in the subtree with a different > > fsid > > > > For NFSv3, the server should export single volumes only (all > > objects > > have the same fsid and the filenos are unique). This is indicated > > to > > the VFS by the use of the NOCROSSMOUNT flag on VOP_LOOKUP() and > > friends. > > > > For NFSv4, the server does export multiple volumes and the boundary > > is indicated by a change in fsid value. > > > > I suspect ZFS snaphots don't obey the above in some way, but that > > is > > just a hunch. > > > > Now, how to narrow this down... > > - Do the above tests (both NFSv4 and NFSv3) and capture the > > packets, > > then look at them in wireshark. In particular, look at the > > fileid numbers > > and fsid values for the various directories under .zfs. > > I gave this a shot, but I haven't used wireshark to capture NFS > traffic > before, so if I need to provide additional details, let me know.. > > NFSv4: > > For /mnt/.zfs/snapshot/20131203: > fileid=4 > fsid4.major=1446349656 > fsid4.minor=222 > > For /mnt/.zfs/snapshot/20131205: > fileid=4 > fsid4.major=1845998066 > fsid4.minor=222 > > For /mnt/jas: > fileid=144 > fsid4.major=597946950 > fsid4.minor=222 > > For /mnt/jas1: > fileid=338 > fsid4.major=597946950 > fsid4.minor=222 > > So fsid is the same for all the different "data" directories, which > is > what I would expect given what you said. I guess each snapshot is > seen > as a unique filesystem... but then a repeating inode in different > filesystems shouldn't be a problem... > > NFSv3: > > For /mnt/.zfs/snapshot/20131203: > fileid=4 > fsid=0x0000000056358b58 > > For /mnt/.zfs/snapshot/20131205: > fileid=4 > fsid=0x000000006e07b1f2 > > For /mnt/jas > fileid=144 > fsid=0x0000000023a3f246 > > For /mnt/jas1: > fileid=338 > fsid=0x0000000023a3f246 > > Here, it seems it's the same, even though it's NFSv3... hmm. > > > > - Try mounting the individual snapshot directory, like > > .zfs/snapshot/20131209 and see if that works (for both NFSv3 > > and NFSv4). > > Hmm .. I tried this: > > /local/backup/home9/.zfs/snapshot/20131203 -ro > archive-mrpriv.cs.yorku.ca > V4: / > > ... but syslog reports: > > Dec 10 22:28:22 jungle mountd[85405]: can't export > /local/backup/home9/.zfs/snapshot/20131203 > > ... and of course I can't mount from either v3/v4. > > On the other hand, I kept it as: > > /local/backup/home9 -ro archive-mrpriv.cs.yorku.ca > V4:/ > > ... and was able to NFSv4 mount > /local/backup/home9/.zfs/snapshot/20131203, and this does indeed > work. > > > - Try doing the mounts with a FreeBSD client and see if you get the > > same > > behaviour? > I found this: > http://forums.freenas.org/threads/mounting-snapshot-directory-using-nfs-from-linux-broken.6060/ > .. implies it will work from FreeBSD/Nexenta, just not Linux. > Found this as well: > https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/lKyfYsjPMNM > > Jason. > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?461272120.32852470.1387414623832.JavaMail.root>