From owner-freebsd-fs@FreeBSD.ORG  Sat Jun 25 13:53:23 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4CD84106564A;
	Sat, 25 Jun 2011 13:53:23 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
	[131.104.91.44])
	by mx1.freebsd.org (Postfix) with ESMTP id B0F968FC08;
	Sat, 25 Jun 2011 13:53:22 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AugAALznBU6DaFvO/2dsb2JhbABShEmTU5AjukOQMYErg3mBDASSA5A3
X-IronPort-AV: E=Sophos;i="4.65,424,1304308800"; d="scan'208";a="129007130"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
	([131.104.91.206])
	by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 25 Jun 2011 09:53:21 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
	by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id CFA7CB3F07;
	Sat, 25 Jun 2011 09:53:21 -0400 (EDT)
Date: Sat, 25 Jun 2011 09:53:21 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <1714423172.1062587.1309010001836.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201106250758.23935.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - SAF3 (Mac)/6.0.10_GA_2692)
Cc: freebsd-fs@freebsd.org, shadow@gmail.com,
	Robert Watson <rwatson@freebsd.org>, Garance A Drosehn <gad@freebsd.org>
Subject: Re: [rfc] 64-bit inode numbers
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Jun 2011 13:53:23 -0000

John Baldwin wrote:
> On Friday, June 24, 2011 11:38:35 pm Benjamin Kaduk wrote:
> > > point. fts(3) and friends will assume that it is a mount point
> > > crossing when st_dev changes. It will then expect that the funny
> > > rule that the d_ino in dirent will not be the same as st_ino.
> > >
> > > What I do for NFSv4 is sythesize the mnt_stat.f_fsid value and
> > > return that as st_dev for the mounted volume until I see the fsid
> > > returned by the server change. Below that point, I return the fsid
> > > from the server as st_dev so long as it isn't the same as the
> >
> > I think I'm confused. You're ... walking a directory heirarchy, and
> > return a fake st_dev value but hold onto the fsid value from the
> > server,
> > then when the fsid from the server changes (due to a ... different
> > NFS
> > mount?), start reporting that new fsid and throw away the fake
> > st_dev
> > value? Can you point me at the code that is doing this?
> 
> I think he's saying that VOP_GETATTR() for different vnodes in a
> single NFSv4
> "mount" (as in 'struct mount *') can return different st_dev values to
> userland where the st_dev value for a given vnode depends on the
> remote
> fsid of the file on the NFSv4 server. That is, for NFSv4 it seems that
> all
> files on a mount do not use the same value of st_dev (as they would
> for a
> local filesystem), but instead only files from the logical volume on
> the
> server share an st_dev. That is, st_dev is per-vnode rather than just
> copied
> from the mount. This is done by storing va_fsid in the NFS attribute
> cache
> for each vnode:
> 
> int
> nfscl_loadattrcache(struct vnode **vpp, struct nfsvattr *nap, void
> *nvaper,
> void *stuff, int writeattr, int dontshrink)
> {
> ...
> /*
> * For NFSv4, if the node's fsid is not equal to the mount point's
> * fsid, return the low order 32bits of the node's fsid. This
> * allows getcwd(3) to work. There is a chance that the fsid might
> * be the same as a local fs, but since this is in an NFS mount
> * point, I don't think that will cause any problems?
> */
> if (NFSHASNFSV4(nmp) && NFSHASHASSETFSID(nmp) &&
> (nmp->nm_fsid[0] != np->n_vattr.na_filesid[0] ||
> nmp->nm_fsid[1] != np->n_vattr.na_filesid[1])) {
> /*
> * va_fsid needs to be set to some value derived from
> * np->n_vattr.na_filesid that is not equal
> * vp->v_mount->mnt_stat.f_fsid[0], so that it changes
> * from the value used for the top level server volume
> * in the mounted subtree.
> */
> if (vp->v_mount->mnt_stat.f_fsid.val[0] !=
> (uint32_t)np->n_vattr.na_filesid[0])
> vap->va_fsid = (uint32_t)np->n_vattr.na_filesid[0];
> else
> vap->va_fsid = (uint32_t)hash32_buf(
> np->n_vattr.na_filesid, 2 * sizeof(uint64_t), 0);
> } else
> vap->va_fsid = vp->v_mount->mnt_stat.f_fsid.val[0];
> ...
> }
> 
> Then for VOP_GETATTR() it returns the va_fsid from the attribute cache
> saved in 'vap' as the vnode's va_fsid which is used to compute st_dev
> in
> vn_stat().
> 
> I think the effect here is that 'mount' still only shows a single
> mountpoint
> for NFSv4, but applications that check for 'st_dev' changing to see if
> they
> are crossing a mountpoint (e.g. find -x) will treat the volumes as
> different
> mountpoints.
> 
Yes, John. You said it way better than I did:-)

This is necessary for NFSv4 because the server crosses server mount
points (unlike NFSv3 where servers do not) and, as such, st_ino is
not unique within one client NFSv4 mount (struct mount *).

Without this, things like "ls -R" will complain about cycles when
the same <st_dev, st_ino> tuple is seen again.

rick