From owner-freebsd-fs@FreeBSD.ORG Thu Dec 12 22:35:21 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 77212BE8 for ; Thu, 12 Dec 2013 22:35:21 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 16ABC1047 for ; Thu, 12 Dec 2013 22:35:20 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqIEAOo4qlKDaFve/2dsb2JhbABZg0JVgwO1VYE0dIIlAQEEASNWGw4KAgINBRQCWQaIDwgNsm+QGBeBKY0ACREBHDQHEgyCT4FIBIlDkAKQZINHHoE1OQ X-IronPort-AV: E=Sophos;i="4.95,474,1384318800"; d="scan'208";a="78299501" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 12 Dec 2013 17:34:11 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 0AFBEB3F4A; Thu, 12 Dec 2013 17:34:11 -0500 (EST) Date: Thu, 12 Dec 2013 17:34:11 -0500 (EST) From: Rick Macklem To: Jason Keltz Message-ID: <1227422149.30131966.1386887651028.JavaMail.root@uoguelph.ca> In-Reply-To: <52AA1965.9080709@cse.yorku.ca> Subject: Re: mount ZFS snapshot on Linux system MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: FreeBSD Filesystems , Steve Dickson X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Dec 2013 22:35:21 -0000 Jason Keltz wrote: > On 12/11/2013 06:21 PM, Rick Macklem wrote: > > Jason Keltz wrote: > >> On 10/12/2013 7:21 PM, Rick Macklem wrote: > >>> Jason Keltz wrote: > >>>> I'm running FreeBSD 9.2 with various ZFS datasets. > >>>> I export a dataset to a Linux system (RHEL64), and mount it. It > >>>> works > >>>> fine... > >>>> When I try to access the ZFS snapshot directory on the Linux NFS > >>>> client, > >>>> things go weird. > >>>> > >>>> With NFSv4: > >>>> > >>>> [jas@archive /]# cd /mnt/.zfs/snapshot > >>>> [jas@archive snapshot]# ls > >>>> 20131203 20131205 20131206 20131207 20131208 20131209 > >>>> 20131210 > >>>> [jas@archive snapshot]# cd 20131210 > >>>> 20131210: Not a directory. > >>>> > >>>> huh? > >>>> > >>>> [jas@archive snapshot]# ls -al > >>>> total 77 > >>>> dr-xr-xr-x 9 root root 9 Dec 10 11:20 . > >>>> dr-xr-xr-x 4 root root 4 Nov 28 15:42 .. > >>>> drwxr-xr-x 380 root root 380 Dec 2 15:56 20131203 > >>>> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131205 > >>>> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131206 > >>>> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131207 > >>>> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131208 > >>>> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131209 > >>>> drwxr-xr-x 381 root root 381 Dec 3 11:24 20131210 > >>>> [jas@archive snapshot]# stat * > >>>> [jas@archive snapshot]# ls -al > >>>> total 292 > >>>> dr-xr-xr-x 9 root root 9 Dec 10 11:20 . > >>>> dr-xr-xr-x 4 root root 4 Nov 28 15:42 .. > >>>> -rw-r--r-- 1 uax guest 137647 Mar 17 2010 20131203 > >>>> -rw-r--r-- 1 uax guest 865 Jul 31 2009 20131205 > >>>> -rw-r--r-- 1 uax guest 137647 Mar 17 2010 20131206 > >>>> -rw-r--r-- 1 uax guest 771 Jul 31 2009 20131207 > >>>> -rw-r--r-- 1 uax guest 778 Jul 31 2009 20131208 > >>>> -rw-r--r-- 1 uax guest 5281 Jul 31 2009 20131209 > >>>> -rw------- 1 btx faculty 893 Jul 13 20:21 20131210 > >>>> > >>>> But it gets even more fun.. > >>>> > >>>> # ls -ali > >>>> total 205 > >>>> 2 dr-xr-xr-x 9 root root 9 Dec 10 11:20 . > >>>> 1 dr-xr-xr-x 4 root root 4 Nov 28 15:42 .. > >>>> 863 -rw-r--r-- 1 uax guest 137647 Mar 17 2010 20131203 > >>>> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 > >>>> 20131205 > >>>> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 > >>>> 20131206 > >>>> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 > >>>> 20131207 > >>>> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 > >>>> 20131208 > >>>> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 > >>>> 20131209 > >>>> 4 drwxr-xr-x 381 root root 381 Dec 3 11:24 > >>>> 20131210 > >>>> > >>>> This is not a user id mapping issue because all the files in > >>>> /mnt > >>>> have > >>>> the proper owner/groups, and I can access them there fine. > >>>> > >>>> I also tried explicitly exporting .zfs/snapshot. The result > >>>> isn't > >>>> any > >>>> different. > >>>> > >>>> If I use nfs v3 it "works", but I'm seeing a whole lot of errors > >>>> like > >>>> these in syslog: > >>>> > >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >>>> /local/backup/home9/.zfs/snapshot/20131203: Invalid argument > >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >>>> /local/backup/home9/.zfs/snapshot/20131209: Invalid argument > >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >>>> /local/backup/home9/.zfs/snapshot/20131210: Invalid argument > >>>> Dec 10 12:32:28 jungle mountd[49579]: can't delete exports for > >>>> /local/backup/home9/.zfs/snapshot/20131207: Invalid argument > >>>> > >>>> It's not clear to me why this doesn't just "work". > >>>> > >>>> Can anyone provide any advice on debugging this? > >>>> > >>> As I think you already know, I know nothing about ZFS and never > >>> use it. > >> Yup! :) > >>> Having said that, I suspect that there are filenos (i-node #s) > >>> that are the same in the snapshot as in the parent file system > >>> tree. > >>> > >>> The basic assumptions are: > >>> - within a file system, all i-node# are unique (represent one > >>> file > >>> object only) and all file objects have the same fsid > >>> - when the fsid changes, that indicates a file system boundary > >>> and > >>> fileno (i-node#s) can be reused in the subtree with a > >>> different > >>> fsid > >>> > >>> For NFSv3, the server should export single volumes only (all > >>> objects > >>> have the same fsid and the filenos are unique). This is indicated > >>> to > >>> the VFS by the use of the NOCROSSMOUNT flag on VOP_LOOKUP() and > >>> friends. > >>> > >>> For NFSv4, the server does export multiple volumes and the > >>> boundary > >>> is indicated by a change in fsid value. > >>> > >>> I suspect ZFS snaphots don't obey the above in some way, but that > >>> is > >>> just a hunch. > >>> > >>> Now, how to narrow this down... > >>> - Do the above tests (both NFSv4 and NFSv3) and capture the > >>> packets, > >>> then look at them in wireshark. In particular, look at the > >>> fileid numbers > >>> and fsid values for the various directories under .zfs. > >> I gave this a shot, but I haven't used wireshark to capture NFS > >> traffic > >> before, so if I need to provide additional details, let me know.. > >> > >> NFSv4: > >> > >> For /mnt/.zfs/snapshot/20131203: > >> fileid=4 > >> fsid4.major=1446349656 > >> fsid4.minor=222 > >> > >> For /mnt/.zfs/snapshot/20131205: > >> fileid=4 > >> fsid4.major=1845998066 > >> fsid4.minor=222 > >> > >> For /mnt/jas: > >> fileid=144 > >> fsid4.major=597946950 > >> fsid4.minor=222 > >> > >> For /mnt/jas1: > >> fileid=338 > >> fsid4.major=597946950 > >> fsid4.minor=222 > >> > >> So fsid is the same for all the different "data" directories, > >> which > >> is > >> what I would expect given what you said. I guess each snapshot > >> is > >> seen > >> as a unique filesystem... but then a repeating inode in different > >> filesystems shouldn't be a problem... > >> > > Yes, it appears that each snapshot is represented as a different > > file > > system. As such, NFSv4 should work for these, but there is an > > additional > > property of the "root" of each of these (20131203, ...). > > When the directory .zfs/snapshot is read, the fileno for 20131203 > > should > > be different than the fileno returned by VOP_GETATTR()/stat() for > > "20131203". > > (The old "mounted-on" vs "root-of-mounted-fs" vnodes which you get > > for a > > "mount point".) > > For NFSv4, the server returns the fileno in the VOP_READDIR() > > dirent as a > > separate attribute called mounted_on_fileid vs the value returned > > by VOP_GETATTR() > > as the fileid attribute. > > If the value of these 2 attributes is the same, it is not a "mount > > point". > > > > So, maybe you could take another look at the packet capture in > > wireshark > > and see what the fileid and mounted_on_fileid attributes are? > > Unfortunately, I didn't save the log, but it was easy enough to > regenerate. > > But before we go there, I've spent a lot of time experimenting with > this, so I can say... > > If I NFSv4 mount nfs-server:/local/backup/home9 to /mnt, then I: > cd /mnt/.zfs/snapshot/20131203 > ... it works great! I can change into any user directory, list > files, etc. > If I then: > cd /mnt/.zfs/snapshot/20131205 > .. it also works great! > But... if I cd into /mnt/.zfs/snapshot, the free ride is over... > all the snapshot directories appear as files and the problem is > there. > > ... unless I unmount and remount, in which case I can repeat. > > I also found that a change of kernel from 2.6.32-358.14.1.el6 (the > kernel I was running with RHEL6.4) to 2.6.32-431.el6 (the kernel that > comes with RHEL6.5) does actually change something important.... > > If I mount nfs-server:/local/backup/home9 and try to change into > "/mnt/.zfs/snapshot" with the new kernel, I still have the problem. > Likewise, if I try to mount nfs-server:/local/backup/home9/.zfs, and > change into "/mnt/snapshot", I also have the problem. > If I mount nfs-server:/local/backup/home9/.zfs/snapshot and change > into > "/mnt", I stil have the older problem, but with the RH 6.4 kernel in > place. > However, if I do the same mount with the newer kernel, it now works. > I > can "ls" and see the snapshot directories. I can change into any of > them, then "cd .." and change into another one. > I tested this on two systems - one where I just installed the entire > 6.5 > upgrade, and the other where I just installed the kernel from 6.5 on > the > 6.4 system so it seems related to the kernel. > It's still not clear why I can't just mount > nfs-server:/local/backup/home9 on RHEL6.5, and the NFSv4 server > figures > it out. I did try from another FreeBSD client, and I can mount the > tree > at any point, and the NFS server is happy. This makes me believe > it's > probably a RHEL NFSv4 bug. > > Here's the numbers.. > > NFSv4: > > So, if I try to access the snapshot path directly, on the way ... > > .zfs: > V4 LOOKUP > fsid.major: 597946950 > fileid: 1 > fattr owner/group are root - correct > > snapshot: > V4 LOOKUP > fsid.major: 597946950 > fileid: 2 > fattr owner/group are root - correct > > If I access /.zfs/snapshot/20131203 directly...: > > 20131203: > V4 LOOKUP > fsid.major: 1446349656 > fileid: 4 > fattr owner/group are root - correct > > V4 READDIR snapshot, 20121203 entry: > fsid.major: 597946950 <-- ???? > fattr4_fileid: 863 > fattr4_owner/group refers to a group on our system (the one displayed > in > ls sometimes).. > FATTR4_MOUNTED_ON_FILEID: 0x000000000000035f > > But if I ls /mnt/.zfs/snapshot: > > V4 LOOKUP: > 201203: > fsid.major: 597946950 > fileid: 4 > > V4 READDIR: > fsid4.major: 597946950 > fattr4_fileid: 863 > fattr4_mounted_on_fileid: 0x000000000000035f > I'll admit I'm not sure what you are looking at, but the above does seem incorrect. Could you email me the raw packet capture, by any chance? (In particular, I need the packet capture for a readdir of .zfs/snapshot, so I can look at the attributes of all the entries.) Assuming the snapshots are represented as separate file systems, when you do a readdir of .zfs/snapshot, the fileid attribute and mounted_on_fileid attributes should be different. (0x35f == 863) Also, the fsid shouldn't be the same as .zfs/snapshot, which is 597946950 it seems. > >> NFSv3: > >> > >> For /mnt/.zfs/snapshot/20131203: > >> fileid=4 > >> fsid=0x0000000056358b58 > >> > >> For /mnt/.zfs/snapshot/20131205: > >> fileid=4 > >> fsid=0x000000006e07b1f2 > >> > >> For /mnt/jas > >> fileid=144 > >> fsid=0x0000000023a3f246 > >> > >> For /mnt/jas1: > >> fileid=338 > >> fsid=0x0000000023a3f246 > >> > >> Here, it seems it's the same, even though it's NFSv3... hmm. > >> > >> > >>> - Try mounting the individual snapshot directory, like > >>> .zfs/snapshot/20131209 and see if that works (for both NFSv3 > >>> and NFSv4). > >> Hmm .. I tried this: > >> > >> /local/backup/home9/.zfs/snapshot/20131203 -ro > >> archive-mrpriv.cs.yorku.ca > >> V4: / > >> > >> ... but syslog reports: > >> > >> Dec 10 22:28:22 jungle mountd[85405]: can't export > >> /local/backup/home9/.zfs/snapshot/20131203 > >> > > mountd will do a VFS_CHECKEXP(), which seems to fail for > > these (which also explains the error messages). To be honest, > > with these failing, remote access should fail. > > > > Also, since NFSv3 exported volumes should not cross > > "mount points" (anywhere the fsid changes), all a mount > > above .zfs/snapshot/20131203 should get are a bunch of > > empty directories called 20131203,... > I tried again just in case I missed something... > nfs-server:/local/backup/home9 on /mnt type nfs > (ro,vers=3,addr=172.16.2.26) > I can change into /mnt/.zfs/snapshot/20131203/jas and list the > directory, or less a file. > > > For example, if in the UFS world with a separate > > file systems /sub1 and /sub1/sub2 with both exported: > > - an NFSv3 mount of /sub1 on /mnt would see an empty > > directory "sub2" when looking in /mnt. (Actually it > > isn't necessarily empty. It might have whatever is in > > the directory when /sub1/sub2 is not mounted.) > > > > This seems pretty obviously broken for ZFS, but I think > > it needs to be fixed in ZFS and I have no idea how to do > > that, since I don`t know if snapshots are real mount points, etc. > > > >> ... and of course I can't mount from either v3/v4. > >> > >> On the other hand, I kept it as: > >> > >> /local/backup/home9 -ro archive-mrpriv.cs.yorku.ca > >> V4:/ > >> > >> ... and was able to NFSv4 mount > >> /local/backup/home9/.zfs/snapshot/20131203, and this does indeed > >> work. > >> > > Yes, although technically it should not work unless 20131203 is > > exported. > Hmm.. I thought that this line in the exports man page meant that it > was okay: > > "Because NFSv4 does not use the mount protocol, the ``administrative > controls'' are not applied. Thus, all the above export line(s) > should > be considered to have the -alldirs flag, even if the line is > specified > without it." > This means that all directories within a file system are exported. Since .zfs/snapshot/20131203 is a separate file system, it should need a separate export entry. ZFS likes to do its own thing w.r.t. exports, so I am not sure what it is actually doing w.r.t. snapshots. > > However, it is probably the easiest work around until this is fixed > > someday. > > So, just to make sure I am clear on this... > > A NFSv4 mount of the snapshot works ok, even for a Linux client > > mount. > Yes. > Although with the new kernel, I can mount > nfs-server:/local/backup/home9/.zfs/snapshot now as well... which is > neat because it solves the problem I was trying to solve.. > I wanted users to be able to view their own snapshots, but not the > snapshots of other users... > Now, on the archive server, I can mount the snapshot dir via NFSv4, > then, through autofs I am able to run a shell script that bind mounts > the users own individual snapshot directories from the NFSv4 mount > into > one directory. I then provide chrooted sftp access to that directory > for users to get at their files. A user now sees "20131203 > 20131204..." > when they sftp in.. > I can't be sure, but since you mentioned above that it is "fixed" by a dismount/remount, that would suggest it depends on what is cached in the client and might break at different times, depending what is cached? If you can do a mount like: # mount -t nfs4 nfs-server:/local/backup/home9/.zfs/snapshot/20131203 /mnt that might work reliably, although it may not be what you want. > >>> - Try doing the mounts with a FreeBSD client and see if you get > >>> the > >>> same > >>> behaviour? > >> I found this: > >> http://forums.freenas.org/threads/mounting-snapshot-directory-using-nfs-from-linux-broken.6060/ > >> .. implies it will work from FreeBSD/Nexenta, just not Linux. > > I suspect this might be the mounted_on_fileid vs fileid issue. > > (ie, The Linux client needs this to be done correctly, but the > > other > > clients figure it out.) > > > > One case that might break for FreeBSD would be to cd into a > > snapshot > > and then do a pwd with the debug.disablecwd sysctl set to 1. > > > > Hopefully the ZFS wizards are reading this, rick > Me too! > > Jason. > >