Date: Tue, 10 Jan 2012 14:02:34 +0000 From: "Robert N. M. Watson" <rwatson@FreeBSD.org> To: Mikolaj Golub <trociny@freebsd.org> Cc: arch@freebsd.org, Kostik Belousov <kib@FreeBSD.org> Subject: Re: unix domain sockets on nullfs(5) Message-ID: <D1B8F00C-1E0D-4916-BD4B-FBCAE28E6F22@FreeBSD.org> In-Reply-To: <86sjjobzmn.fsf@kopusha.home.net> References: <86sjjobzmn.fsf@kopusha.home.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 9 Jan 2012, at 16:37, Mikolaj Golub wrote: > kib@ suggested that the issue could be fixed if one added new VOP_* > operations for setting and accessing vnode's v_socket field. I like the philosophy of the proposed approach signifiantly better than = the previous discussed approaches. Some thoughts: (1) I don't think the new behaviour should be optional -- it was always = the intent that nullfs pass through all behaviours to the underlying = layer, it's just that certain edge cases didn't appear in the original = implementation. Memory mapping was fixed a few years ago using similar = techniques. This will significantly reduce the complexity of your patch, = and also avoid user confusion since it will now behave "as expected". = Certainly, mention in future release notes would be appropriate, = however. (2) I'd like to think (as John also mentioned?) that we could use a = shared vnode lock when doing read-only access (i.e., connect). (3) With this patch, an rwlock is held over vnode operations -- required = due to the interlocked synchronisation of vnode and unix domain sockets. = We often try to avoid doing this for reasons of lock order (and = principle). It appears that it is likely fine in this case but it makes = me slightly nervous. (4) I'm slightly puzzled by the bind(2) case and interactions with = layering -- possibly there is a bug here. If I issue bind(2) against the = top layer, it looks like vp->v_socket will be set in the bottom layer, = but unp->unp_vnode will be assigned to the top-layer vnode? My = assumption was that you would want unp_vnode to always point to the = bottom (real) vnode, which suggest to me that the VOPs shouldn't just = assign v_socket, but should also assign unp_vnode. This has implications = elsewhere in uipc_usrreq.c as well. Could you clarify whether you think = this could be an issue? It may also be worth KASSERTing that the = top-level vnode never points at anything but NULL to catch bugs like = this. This may mean the VOPs have to have a bit of "test-and-set" to = them to get atomicity properties right when it comes to bind(2). In general, I think this is the right thing to do, and I'm very pleased = you're doing it -- but the patch requires some further work. Robert > The attached patch implements this. It also can be found here: >=20 > http://people.freebsd.org/~trociny/nullfs.VOP_UNP.4.patch >=20 > It adds three VOP_* operations: VOP_UNPBIND, VOP_UNPCONNECT and > VOP_UNPDETACH. Their purpose can be understood from the modifications > in uipc_usrreq.c: >=20 > - vp->v_socket =3D unp->unp_socket; > + VOP_UNPBIND(vp, unp->unp_socket); >=20 > - so2 =3D vp->v_socket; > + VOP_UNPCONNECT(vp, &so2); >=20 > - unp->unp_vnode->v_socket =3D NULL; > + VOP_UNPDETACH(unp->unp_vnode); >=20 > The default functions just do these simple operations, while > filesystems like nullfs can do more complicated things. >=20 > The patch also implements functions for nullfs. By default the old > behavior is preserved. To get the new behaviour the filesystem should > be (re)mounted with sobypass option. Then the socket operations are > bypassed to a lower vnode, which makes the socket be accessible from > both layers. >=20 > I am very interested to hear other people opinion on this. >=20 > --=20 > Mikolaj Golub >=20 > Index: sys/sys/vnode.h > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/sys/vnode.h (revision 229701) > +++ sys/sys/vnode.h (working copy) > @@ -695,6 +695,9 @@ int vop_stdpathconf(struct vop_pathconf_args = *); > int vop_stdpoll(struct vop_poll_args *); > int vop_stdvptocnp(struct vop_vptocnp_args *ap); > int vop_stdvptofh(struct vop_vptofh_args *ap); > +int vop_stdunpbind(struct vop_unpbind_args *ap); > +int vop_stdunpconnect(struct vop_unpconnect_args *ap); > +int vop_stdunpdetach(struct vop_unpdetach_args *ap); > int vop_eopnotsupp(struct vop_generic_args *ap); > int vop_ebadf(struct vop_generic_args *ap); > int vop_einval(struct vop_generic_args *ap); > Index: sys/kern/uipc_usrreq.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/kern/uipc_usrreq.c (revision 229701) > +++ sys/kern/uipc_usrreq.c (working copy) > @@ -542,7 +542,7 @@ restart: >=20 > UNP_LINK_WLOCK(); > UNP_PCB_LOCK(unp); > - vp->v_socket =3D unp->unp_socket; > + VOP_UNPBIND(vp, unp->unp_socket); > unp->unp_vnode =3D vp; > unp->unp_addr =3D soun; > unp->unp_flags &=3D ~UNP_BINDING; > @@ -638,7 +638,7 @@ uipc_detach(struct socket *so) > * XXXRW: Should assert vp->v_socket =3D=3D so. > */ > if ((vp =3D unp->unp_vnode) !=3D NULL) { > - unp->unp_vnode->v_socket =3D NULL; > + VOP_UNPDETACH(vp); > unp->unp_vnode =3D NULL; > } > unp2 =3D unp->unp_conn; > @@ -1308,7 +1308,7 @@ unp_connect(struct socket *so, struct sockaddr = *na > * and to protect simultaneous locking of multiple pcbs. > */ > UNP_LINK_WLOCK(); > - so2 =3D vp->v_socket; > + VOP_UNPCONNECT(vp, &so2); > if (so2 =3D=3D NULL) { > error =3D ECONNREFUSED; > goto bad2; > Index: sys/kern/vfs_default.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/kern/vfs_default.c (revision 229701) > +++ sys/kern/vfs_default.c (working copy) > @@ -123,6 +123,9 @@ struct vop_vector default_vnodeops =3D { > .vop_unlock =3D vop_stdunlock, > .vop_vptocnp =3D vop_stdvptocnp, > .vop_vptofh =3D vop_stdvptofh, > + .vop_unpbind =3D vop_stdunpbind, > + .vop_unpconnect =3D vop_stdunpconnect, > + .vop_unpdetach =3D vop_stdunpdetach, > }; >=20 > /* > @@ -1037,6 +1040,39 @@ vop_stdadvise(struct vop_advise_args *ap) > return (error); > } >=20 > +int > +vop_stdunpbind(struct vop_unpbind_args *ap) > +{ > + struct vnode *vp; > + > + vp =3D ap->a_vp; > + > + vp->v_socket =3D ap->a_socket; > + return (0); > +} > + > +int > +vop_stdunpconnect(struct vop_unpconnect_args *ap) > +{ > + struct vnode *vp; > + > + vp =3D ap->a_vp; > + > + *ap->a_socket =3D vp->v_socket; > + return (0); > +} > + > +int > +vop_stdunpdetach(struct vop_unpdetach_args *ap) > +{ > + struct vnode *vp; > + > + vp =3D ap->a_vp; > + > + vp->v_socket =3D NULL; > + return (0); > +} > + > /* > * vfs default ops > * used to fill the vfs function table to get reasonable default = return values. > Index: sys/kern/vnode_if.src > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/kern/vnode_if.src (revision 229701) > +++ sys/kern/vnode_if.src (working copy) > @@ -639,3 +639,23 @@ vop_advise { > IN off_t end; > IN int advice; > }; > + > +%% unpbind vp E E E > + > +vop_unpbind { > + IN struct vnode *vp; > + IN struct socket *socket; > +}; > + > +%% unpconnect vp E E E > + > +vop_unpconnect { > + IN struct vnode *vp; > + OUT struct socket **socket; > +}; > + > +%% unpdetach vp E E E > + > +vop_unpdetach { > + IN struct vnode *vp; > +}; > Index: sys/fs/nullfs/null.h > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/fs/nullfs/null.h (revision 229701) > +++ sys/fs/nullfs/null.h (working copy) > @@ -37,8 +37,15 @@ > struct null_mount { > struct mount *nullm_vfs; > struct vnode *nullm_rootvp; /* Reference to root null_node = */ > + uint64_t nullm_flags; /* nullfs options specific for = mount */ > }; >=20 > +/* > + * Flags stored in nullm_flags. > + */ > +#define NULLMNT_SOBYPASS 0x00000001 /* Bypass unix = socket operations > + to lower vnode */ > + > #ifdef _KERNEL > /* > * A cache of vnode references > @@ -47,8 +54,16 @@ struct null_node { > LIST_ENTRY(null_node) null_hash; /* Hash list */ > struct vnode *null_lowervp; /* VREFed once */ > struct vnode *null_vnode; /* Back pointer */ > + u_int null_flags; /* Flags */ > }; >=20 > +/* > + * Flags stored in null_flags. > + */ > + > +#define NULL_SOBYPASS 0x00000001 /* Bypass unix socket = operations > + to lower vnode */ > + > #define MOUNTTONULLMOUNT(mp) ((struct null_mount = *)((mp)->mnt_data)) > #define VTONULL(vp) ((struct null_node *)(vp)->v_data) > #define NULLTOV(xp) ((xp)->null_vnode) > Index: sys/fs/nullfs/null_vnops.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/fs/nullfs/null_vnops.c (revision 229701) > +++ sys/fs/nullfs/null_vnops.c (working copy) > @@ -812,6 +812,52 @@ null_vptocnp(struct vop_vptocnp_args *ap) > return (error); > } >=20 > +static int > +null_unpbind(struct vop_unpbind_args *ap) > +{ > + struct vnode *vp; > + struct null_node *xp; > + struct null_mount *xmp; > + > + vp =3D ap->a_vp; > + xp =3D VTONULL(vp); > + xmp =3D MOUNTTONULLMOUNT(vp->v_mount); > + if (xmp->nullm_flags & NULLMNT_SOBYPASS) { > + xp->null_flags |=3D NULL_SOBYPASS; > + return (null_bypass((struct vop_generic_args *)ap)); > + } else { > + return (vop_stdunpbind(ap)); > + } > +} > + > +static int > +null_unpconnect(struct vop_unpconnect_args *ap) > +{ > + struct vnode *vp; > + struct null_mount *xmp; > + > + vp =3D ap->a_vp; > + xmp =3D MOUNTTONULLMOUNT(vp->v_mount); > + if (xmp->nullm_flags & NULLMNT_SOBYPASS) > + return (null_bypass((struct vop_generic_args *)ap)); > + else > + return (vop_stdunpconnect(ap)); > +} > + > +static int > +null_unpdetach(struct vop_unpdetach_args *ap) > +{ > + struct vnode *vp; > + struct null_node *xp; > + > + vp =3D ap->a_vp; > + xp =3D VTONULL(vp); > + if (xp->null_flags & NULL_SOBYPASS) > + return (null_bypass((struct vop_generic_args *)ap)); > + else > + return (vop_stdunpdetach(ap)); > +} > + > /* > * Global vfs data structures > */ > @@ -837,4 +883,7 @@ struct vop_vector null_vnodeops =3D { > .vop_unlock =3D null_unlock, > .vop_vptocnp =3D null_vptocnp, > .vop_vptofh =3D null_vptofh, > + .vop_unpbind =3D null_unpbind, > + .vop_unpconnect =3D null_unpconnect, > + .vop_unpdetach =3D null_unpdetach, > }; > Index: sys/fs/nullfs/null_subr.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/fs/nullfs/null_subr.c (revision 229701) > +++ sys/fs/nullfs/null_subr.c (working copy) > @@ -235,6 +235,7 @@ null_nodeget(mp, lowervp, vpp) >=20 > xp->null_vnode =3D vp; > xp->null_lowervp =3D lowervp; > + xp->null_flags =3D 0; > vp->v_type =3D lowervp->v_type; > vp->v_data =3D xp; > vp->v_vnlock =3D lowervp->v_vnlock; > Index: sys/fs/nullfs/null_vfsops.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/fs/nullfs/null_vfsops.c (revision 229701) > +++ sys/fs/nullfs/null_vfsops.c (working copy) > @@ -84,16 +84,26 @@ nullfs_mount(struct mount *mp) > if (mp->mnt_flag & MNT_ROOTFS) > return (EOPNOTSUPP); > /* > - * Update is a no-op > + * Update is supported only for some options. > */ > if (mp->mnt_flag & MNT_UPDATE) { > - /* > - * Only support update mounts for NFS export. > - */ > + error =3D EOPNOTSUPP; > + xmp =3D MOUNTTONULLMOUNT(mp); > + if (vfs_flagopt(mp->mnt_optnew, "sobypass", NULL, 0)) { > + MNT_ILOCK(mp); > + xmp->nullm_flags |=3D NULLMNT_SOBYPASS; > + MNT_IUNLOCK(mp); > + error =3D 0; > + } > + if (vfs_flagopt(mp->mnt_optnew, "nosobypass", NULL, 0)) = { > + MNT_ILOCK(mp); > + xmp->nullm_flags &=3D ~NULLMNT_SOBYPASS; > + MNT_IUNLOCK(mp); > + error =3D 0; > + } > if (vfs_flagopt(mp->mnt_optnew, "export", NULL, 0)) > - return (0); > - else > - return (EOPNOTSUPP); > + error =3D 0; > + return (error); > } >=20 > /* > @@ -182,6 +192,11 @@ nullfs_mount(struct mount *mp) > MNT_ILOCK(mp); > mp->mnt_kern_flag |=3D lowerrootvp->v_mount->mnt_kern_flag & = MNTK_MPSAFE; > MNT_IUNLOCK(mp); > + > + xmp->nullm_flags =3D 0; > + vfs_flagopt(mp->mnt_optnew, "sobypass", &xmp->nullm_flags, > + NULLMNT_SOBYPASS); > + > mp->mnt_data =3D xmp; > vfs_getnewfsid(mp); >=20 > Index: sbin/mount_nullfs/mount_nullfs.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sbin/mount_nullfs/mount_nullfs.c (revision 229701) > +++ sbin/mount_nullfs/mount_nullfs.c (working copy) > @@ -57,27 +57,36 @@ static const char rcsid[] =3D >=20 > #include "mntopts.h" >=20 > +#define NULLOPT_SOBYPASS 0x00000001 > +#define NULLOPT_MASK (NULLOPT_SOBYPASS) > + > static struct mntopt mopts[] =3D { > MOPT_STDOPTS, > + MOPT_UPDATE, > + {"sobypass", 0, NULLOPT_SOBYPASS, 1}, > MOPT_END > }; >=20 > +static char fstype[] =3D "nullfs"; > + > int subdir(const char *, const char *); > static void usage(void) __dead2; >=20 > int > main(int argc, char *argv[]) > { > - struct iovec iov[6]; > - int ch, mntflags; > + struct iovec *iov; > + int ch, iovlen, mntflags, nullflags, negflags; > char source[MAXPATHLEN]; > char target[MAXPATHLEN]; >=20 > - mntflags =3D 0; > + mntflags =3D nullflags =3D 0; > + negflags =3D NULLOPT_MASK; > while ((ch =3D getopt(argc, argv, "o:")) !=3D -1) > switch(ch) { > case 'o': > - getmntopts(optarg, mopts, &mntflags, 0); > + getmntopts(optarg, mopts, &mntflags, = &nullflags); > + getmntopts(optarg, mopts, &mntflags, &negflags); > break; > case '?': > default: > @@ -97,20 +106,18 @@ main(int argc, char *argv[]) > errx(EX_USAGE, "%s (%s) and %s are not distinct paths", > argv[0], target, argv[1]); >=20 > - iov[0].iov_base =3D strdup("fstype"); > - iov[0].iov_len =3D sizeof("fstype"); > - iov[1].iov_base =3D strdup("nullfs"); > - iov[1].iov_len =3D strlen(iov[1].iov_base) + 1; > - iov[2].iov_base =3D strdup("fspath"); > - iov[2].iov_len =3D sizeof("fspath"); > - iov[3].iov_base =3D source; > - iov[3].iov_len =3D strlen(source) + 1; > - iov[4].iov_base =3D strdup("target"); > - iov[4].iov_len =3D sizeof("target"); > - iov[5].iov_base =3D target; > - iov[5].iov_len =3D strlen(target) + 1; > - > - if (nmount(iov, 6, mntflags)) > + iov =3D NULL; > + iovlen =3D 0; > + build_iovec(&iov, &iovlen, "fstype", fstype, (size_t)-1); > + build_iovec(&iov, &iovlen, "fspath", source, (size_t)-1); > + build_iovec(&iov, &iovlen, "target", target, (size_t)-1); > + if ((nullflags & NULLOPT_SOBYPASS) !=3D 0) > + build_iovec(&iov, &iovlen, "sobypass", NULL, 0); > + if ((mntflags & MNT_UPDATE) !=3D 0) { > + if ((negflags & NULLOPT_SOBYPASS) =3D=3D 0) > + build_iovec(&iov, &iovlen, "nosobypass", NULL, = 0); > + } > + if (nmount(iov, iovlen, mntflags)) > err(1, NULL); > exit(0); > } > Index: sbin/mount_nullfs/mount_nullfs.8 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sbin/mount_nullfs/mount_nullfs.8 (revision 229701) > +++ sbin/mount_nullfs/mount_nullfs.8 (working copy) > @@ -79,8 +79,14 @@ Options are specified with a > flag followed by a comma separated string of options. > See the > .Xr mount 8 > -man page for possible options and their meanings. > +man page for standard options and their meanings. > +Options specific for > +.Nm : > +.Bl -tag -width sobypass > +.It Cm sobypass > +Bypass unix socket operations to the lower layer. > .El > +.El > .Pp > The null layer has two purposes. > First, it serves as a demonstration of layering by providing a layer
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D1B8F00C-1E0D-4916-BD4B-FBCAE28E6F22>