Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Dec 2013 15:15:33 -0500
From:      Jason Keltz <jas@cse.yorku.ca>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        FreeBSD Filesystems <freebsd-fs@freebsd.org>, Steve Dickson <SteveD@redhat.com>
Subject:   Re: mount ZFS snapshot on Linux system
Message-ID:  <52AA1965.9080709@cse.yorku.ca>
In-Reply-To: <116973401.29503791.1386804115064.JavaMail.root@uoguelph.ca>
References:  <116973401.29503791.1386804115064.JavaMail.root@uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
On 12/11/2013 06:21 PM, Rick Macklem wrote:
> Jason Keltz wrote:
>> On 10/12/2013 7:21 PM, Rick Macklem wrote:
>>> Jason Keltz wrote:
>>>> I'm running FreeBSD 9.2 with various ZFS datasets.
>>>> I export a dataset to a Linux system (RHEL64), and mount it.  It
>>>> works
>>>> fine...
>>>> When I try to access the ZFS snapshot directory on the Linux NFS
>>>> client,
>>>> things go weird.
>>>>
>>>> With NFSv4:
>>>>
>>>> [jas@archive /]# cd /mnt/.zfs/snapshot
>>>> [jas@archive snapshot]# ls
>>>> 20131203  20131205  20131206  20131207  20131208  20131209
>>>>   20131210
>>>> [jas@archive snapshot]# cd 20131210
>>>> 20131210: Not a directory.
>>>>
>>>> huh?
>>>>
>>>> [jas@archive snapshot]# ls -al
>>>> total 77
>>>> dr-xr-xr-x   9 root root   9 Dec 10 11:20 .
>>>> dr-xr-xr-x   4 root root   4 Nov 28 15:42 ..
>>>> drwxr-xr-x 380 root root 380 Dec  2 15:56 20131203
>>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131205
>>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131206
>>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131207
>>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131208
>>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131209
>>>> drwxr-xr-x 381 root root 381 Dec  3 11:24 20131210
>>>> [jas@archive snapshot]# stat *
>>>> [jas@archive snapshot]# ls -al
>>>> total 292
>>>> dr-xr-xr-x 9 root      root         9 Dec 10 11:20 .
>>>> dr-xr-xr-x 4 root      root         4 Nov 28 15:42 ..
>>>> -rw-r--r-- 1 uax    guest   137647 Mar 17  2010 20131203
>>>> -rw-r--r-- 1 uax    guest         865 Jul 31  2009 20131205
>>>> -rw-r--r-- 1 uax    guest   137647 Mar 17  2010 20131206
>>>> -rw-r--r-- 1 uax    guest         771 Jul 31  2009 20131207
>>>> -rw-r--r-- 1 uax    guest         778 Jul 31  2009 20131208
>>>> -rw-r--r-- 1 uax     guest       5281 Jul 31  2009 20131209
>>>> -rw------- 1 btx      faculty      893 Jul 13 20:21 20131210
>>>>
>>>> But it gets even more fun..
>>>>
>>>> # ls -ali
>>>> total 205
>>>>      2 dr-xr-xr-x   9 root      root       9 Dec 10 11:20 .
>>>>      1 dr-xr-xr-x   4 root      root       4 Nov 28 15:42 ..
>>>> 863 -rw-r--r--   1 uax     guest 137647 Mar 17  2010 20131203
>>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131205
>>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131206
>>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131207
>>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131208
>>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131209
>>>>      4 drwxr-xr-x 381 root      root     381 Dec  3 11:24 20131210
>>>>
>>>> This is not a user id mapping issue because all the files in /mnt
>>>> have
>>>> the proper owner/groups, and I can access them there fine.
>>>>
>>>> I also tried explicitly exporting .zfs/snapshot.  The result isn't
>>>> any
>>>> different.
>>>>
>>>> If I use nfs v3 it "works", but I'm seeing a whole lot of errors
>>>> like
>>>> these in syslog:
>>>>
>>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
>>>> /local/backup/home9/.zfs/snapshot/20131203: Invalid argument
>>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
>>>> /local/backup/home9/.zfs/snapshot/20131209: Invalid argument
>>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
>>>> /local/backup/home9/.zfs/snapshot/20131210: Invalid argument
>>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for
>>>> /local/backup/home9/.zfs/snapshot/20131207: Invalid argument
>>>>
>>>> It's not clear to me why this doesn't just "work".
>>>>
>>>> Can anyone provide any advice on debugging this?
>>>>
>>> As I think you already know, I know nothing about ZFS and never
>>> use it.
>> Yup! :)
>>> Having said that, I suspect that there are filenos (i-node #s)
>>> that are the same in the snapshot as in the parent file system
>>> tree.
>>>
>>> The basic assumptions are:
>>> - within a file system, all i-node# are unique (represent one file
>>>     object only) and all file objects have the same fsid
>>> - when the fsid changes, that indicates a file system boundary and
>>>     fileno (i-node#s) can be reused in the subtree with a different
>>>     fsid
>>>
>>> For NFSv3, the server should export single volumes only (all
>>> objects
>>> have the same fsid and the filenos are unique). This is indicated
>>> to
>>> the VFS by the use of the NOCROSSMOUNT flag on VOP_LOOKUP() and
>>> friends.
>>>
>>> For NFSv4, the server does export multiple volumes and the boundary
>>> is indicated by a change in fsid value.
>>>
>>> I suspect ZFS snaphots don't obey the above in some way, but that
>>> is
>>> just a hunch.
>>>
>>> Now, how to narrow this down...
>>> - Do the above tests (both NFSv4 and NFSv3) and capture the
>>> packets,
>>>     then look at them in wireshark. In particular, look at the
>>>     fileid numbers
>>>     and fsid values for the various directories under .zfs.
>> I gave this a shot, but I haven't used wireshark to capture NFS
>> traffic
>> before, so if I need to provide additional details, let me know..
>>
>> NFSv4:
>>
>> For /mnt/.zfs/snapshot/20131203:
>> fileid=4
>> fsid4.major=1446349656
>> fsid4.minor=222
>>
>> For /mnt/.zfs/snapshot/20131205:
>> fileid=4
>> fsid4.major=1845998066
>> fsid4.minor=222
>>
>> For /mnt/jas:
>> fileid=144
>> fsid4.major=597946950
>> fsid4.minor=222
>>
>> For /mnt/jas1:
>> fileid=338
>> fsid4.major=597946950
>> fsid4.minor=222
>>
>> So fsid is the same for all the different "data" directories, which
>> is
>> what I would expect given what you said.  I  guess each snapshot is
>> seen
>> as a unique filesystem...  but then a repeating inode in different
>> filesystems shouldn't be a problem...
>>
> Yes, it appears that each snapshot is represented as a different file
> system. As such, NFSv4 should work for these, but there is an additional
> property of the "root" of each of these (20131203, ...).
> When the directory .zfs/snapshot is read, the fileno for 20131203 should
> be different than the fileno returned by VOP_GETATTR()/stat() for "20131203".
> (The old "mounted-on" vs "root-of-mounted-fs" vnodes which you get for a
>   "mount point".)
> For NFSv4, the server returns the fileno in the VOP_READDIR() dirent as a
> separate attribute called mounted_on_fileid vs the value returned by VOP_GETATTR()
> as the fileid attribute.
> If the value of these 2 attributes is the same, it is not a "mount point".
>
> So, maybe you could take another look at the packet capture in wireshark
> and see what the fileid and mounted_on_fileid attributes are?

Unfortunately, I didn't save the log, but it was easy enough to regenerate.

But before we go there, I've spent a lot of time experimenting with 
this, so I can say...

If I NFSv4 mount nfs-server:/local/backup/home9 to /mnt, then I:
cd /mnt/.zfs/snapshot/20131203
... it works great!  I can change into any user directory, list files, etc.
If I then:
cd /mnt/.zfs/snapshot/20131205
.. it also works great!
But... if I cd into /mnt/.zfs/snapshot, the free ride is over...
all the snapshot directories appear as files and the problem is there.

... unless I unmount and remount, in which case I can repeat.

I also found that a change of kernel from 2.6.32-358.14.1.el6  (the 
kernel I was running with RHEL6.4) to 2.6.32-431.el6 (the kernel that 
comes with RHEL6.5) does actually change something important....

If I mount nfs-server:/local/backup/home9 and try to change into 
"/mnt/.zfs/snapshot" with the new kernel, I still have the problem.
Likewise, if I try to mount nfs-server:/local/backup/home9/.zfs, and 
change into "/mnt/snapshot",  I also have the problem.
If I mount nfs-server:/local/backup/home9/.zfs/snapshot and change into 
"/mnt", I stil have the older problem, but with the RH 6.4 kernel in place.
However, if I do the same mount with the newer kernel, it now works.  I 
can "ls" and see the snapshot directories.  I can change into any of 
them, then "cd .." and change into another one.
I tested this on two systems - one where I just installed the entire 6.5 
upgrade, and the other where I just installed the kernel from 6.5 on the 
6.4 system so it seems related to the kernel.
It's still not clear why I can't just mount 
nfs-server:/local/backup/home9 on RHEL6.5, and the NFSv4 server figures 
it out.  I did try from another FreeBSD client, and I can mount the tree 
at any point, and the NFS server is happy.  This makes me believe it's 
probably a RHEL NFSv4 bug.

Here's the numbers..

NFSv4:

So, if I try to access the snapshot path directly, on the way ...

.zfs:
V4 LOOKUP
fsid.major: 597946950
fileid: 1
fattr owner/group are root - correct

snapshot:
V4 LOOKUP
fsid.major: 597946950
fileid: 2
fattr owner/group are root - correct

If I access /.zfs/snapshot/20131203 directly...:

20131203:
V4 LOOKUP
fsid.major: 1446349656
fileid: 4
fattr owner/group are root - correct

V4 READDIR snapshot, 20121203 entry:
fsid.major: 597946950 <-- ????
fattr4_fileid: 863
fattr4_owner/group refers to a group on our system (the one displayed in 
ls sometimes)..
FATTR4_MOUNTED_ON_FILEID: 0x000000000000035f

But if I ls /mnt/.zfs/snapshot:

V4 LOOKUP:
201203:
fsid.major: 597946950
fileid: 4

V4 READDIR:
fsid4.major: 597946950
fattr4_fileid: 863
fattr4_mounted_on_fileid: 0x000000000000035f

>> NFSv3:
>>
>> For /mnt/.zfs/snapshot/20131203:
>> fileid=4
>> fsid=0x0000000056358b58
>>
>> For /mnt/.zfs/snapshot/20131205:
>> fileid=4
>> fsid=0x000000006e07b1f2
>>
>> For /mnt/jas
>> fileid=144
>> fsid=0x0000000023a3f246
>>
>> For /mnt/jas1:
>> fileid=338
>> fsid=0x0000000023a3f246
>>
>> Here, it seems it's the same, even though it's NFSv3... hmm.
>>
>>
>>> - Try mounting the individual snapshot directory, like
>>>      .zfs/snapshot/20131209 and see if that works (for both NFSv3
>>>      and NFSv4).
>> Hmm .. I tried this:
>>
>> /local/backup/home9/.zfs/snapshot/20131203  -ro
>> archive-mrpriv.cs.yorku.ca
>> V4: /
>>
>> ... but syslog reports:
>>
>> Dec 10 22:28:22 jungle mountd[85405]: can't export
>> /local/backup/home9/.zfs/snapshot/20131203
>>
> mountd will do a VFS_CHECKEXP(), which seems to fail for
> these (which also explains the error messages). To be honest,
> with these failing, remote access should fail.
>
> Also, since NFSv3 exported volumes should not cross
> "mount points" (anywhere the fsid changes), all a mount
> above .zfs/snapshot/20131203 should get are a bunch of
> empty directories called 20131203,...
I tried again just in case I missed something...
nfs-server:/local/backup/home9 on /mnt type nfs (ro,vers=3,addr=172.16.2.26)
I can change into /mnt/.zfs/snapshot/20131203/jas and list the 
directory, or less a file.

> For example, if in the UFS world with a separate
> file systems /sub1 and /sub1/sub2 with both exported:
> - an NFSv3 mount of /sub1 on /mnt would see an empty
>    directory "sub2" when looking in /mnt. (Actually it
>    isn't necessarily empty. It might have whatever is in
>    the directory when /sub1/sub2 is not mounted.)
>
> This seems pretty obviously broken for ZFS, but I think
> it needs to be fixed in ZFS and I have no idea how to do
> that, since I don`t know if snapshots are real mount points, etc.
>
>> ... and of course I can't mount from either v3/v4.
>>
>> On the other hand, I kept it as:
>>
>> /local/backup/home9 -ro archive-mrpriv.cs.yorku.ca
>> V4:/
>>
>> ... and was able to NFSv4 mount
>> /local/backup/home9/.zfs/snapshot/20131203, and this does indeed
>> work.
>>
> Yes, although technically it should not work unless 20131203 is
> exported.
Hmm..  I thought that this line in the exports man page meant that it 
was okay:

"Because NFSv4 does not use the mount protocol, the ``administrative 
controls'' are not applied.  Thus, all the above export line(s) should 
be considered to have the -alldirs flag, even if the line is specified 
without it."

> However, it is probably the easiest work around until this is fixed
> someday.
> So, just to make sure I am clear on this...
> A NFSv4 mount of the snapshot works ok, even for a Linux client mount.
Yes.
Although with the new kernel,  I can mount 
nfs-server:/local/backup/home9/.zfs/snapshot now as well... which is 
neat because it solves the problem I was trying to solve..
I wanted users to be able to view their own snapshots, but not the 
snapshots of other users...
Now, on the archive server, I can mount the snapshot dir via NFSv4, 
then, through autofs I am able to run a shell script that bind mounts 
the users own individual snapshot directories from the NFSv4 mount into 
one directory.  I then provide chrooted sftp access to that directory 
for users to get at their files.  A user now sees "20131203 20131204..." 
when they sftp in..

>>> - Try doing the mounts with a FreeBSD client and see if you get the
>>> same
>>>     behaviour?
>> I found this:
>> http://forums.freenas.org/threads/mounting-snapshot-directory-using-nfs-from-linux-broken.6060/
>> .. implies it will work from FreeBSD/Nexenta, just not Linux.
> I suspect this might be the mounted_on_fileid vs fileid issue.
> (ie, The Linux client needs this to be done correctly, but the other
>   clients figure it out.)
>
> One case that might break for FreeBSD would be to cd into a snapshot
> and then do a pwd with the debug.disablecwd sysctl set to 1.
>
> Hopefully the ZFS wizards are reading this, rick
Me too!

Jason.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?52AA1965.9080709>