From owner-freebsd-arch@FreeBSD.ORG  Tue Jan 10 13:19:21 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 13111106566B;
	Tue, 10 Jan 2012 13:19:21 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id C5F2F8FC0A;
	Tue, 10 Jan 2012 13:19:20 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170])
	by cyrus.watson.org (Postfix) with ESMTPSA id 69CC046B2C;
	Tue, 10 Jan 2012 08:19:20 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id E5432B91A;
	Tue, 10 Jan 2012 08:19:19 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Tue, 10 Jan 2012 08:19:18 -0500
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; )
References: <86sjjobzmn.fsf@kopusha.home.net>
In-Reply-To: <86sjjobzmn.fsf@kopusha.home.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201201100819.18892.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Tue, 10 Jan 2012 08:19:20 -0500 (EST)
Cc: Mikolaj Golub <trociny@freebsd.org>, arch@freebsd.org,
	Robert Watson <rwatson@freebsd.org>, Kostik Belousov <kib@freebsd.org>
Subject: Re: unix domain sockets on nullfs(5)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Jan 2012 13:19:21 -0000

On Monday, January 09, 2012 11:37:52 am Mikolaj Golub wrote:
> Hi,
> 
> There is a longstanding problem with nullfs(5) that is unix sockets do
> not work between lower and upper layers.
> 
> See, e.g. kern/51583, kern/159663.
> 
> On a unix socket binding the created socket is referenced in the vnode
> field v_socket. This field is used on connect (from the vnode returned
> by lookup). Unix socket functions like unp_bind/connect set/access
> this field directly.
> 
> This is the issue for nullfs, which uses two-layer vnode approach:
> binding to the upper layer, the socket reference is stored in the
> upper vnode; binding to the lower fs, the socket reference is stored
> in the lower vnode and is not seen from the upper layer.
> 
> E.g. having /mnt/upper nullfs mounted on /mnt/lower:
> 
> 1) if we bind to /mnt/lower/test.sock we can connect only to
> /mnt/lower/test.sock.
> 
> 2) if we bind to /mnt/upper/test.sock we can connect only to
> /mnt/upper/test.sock.
> 
> The desired behavior is one can connect to both the lower and the
> upper paths regardless if we bind to /mnt/lower/test.sock or
> /mnt/upeer/test.sock.
> 
> In kern/159663 two approaches were discussed:
> 
> 1) copy the socket pointer from lower vnode to upper vnode on the
> upper vnode get  (fix the case when one binds to the lower fs and wants
> to connect via the upper, but does not fix the case when one binds to
> the upper and wants to connect via the lower fs);
> 
> 2) make null_lookup/create return lower vnode for VSOCK vnodes.
> 
> Both approaches have issues and looks rather hackish.
> 
> kib@ suggested that the issue could be fixed if one added new VOP_*
> operations for setting and accessing vnode's v_socket field.
> 
> The attached patch implements this. It also can be found here:
> 
> http://people.freebsd.org/~trociny/nullfs.VOP_UNP.4.patch
> 
> It adds three VOP_* operations: VOP_UNPBIND, VOP_UNPCONNECT and
> VOP_UNPDETACH. Their purpose can be understood from the modifications
> in uipc_usrreq.c:
> 
> -	vp->v_socket = unp->unp_socket;
> +	VOP_UNPBIND(vp, unp->unp_socket);
> 
> -	so2 = vp->v_socket;
> +	VOP_UNPCONNECT(vp, &so2);
> 
> -	unp->unp_vnode->v_socket = NULL;
> +	VOP_UNPDETACH(unp->unp_vnode);
> 
> The default functions just do these simple operations, while
> filesystems like nullfs can do more complicated things.
> 
> The patch also implements functions for nullfs. By default the old
> behavior is preserved. To get the new behaviour the filesystem should
> be (re)mounted with sobypass option. Then the socket operations are
> bypassed to a lower vnode, which makes the socket be accessible from
> both layers.
> 
> I am very interested to hear other people opinion on this.

I think this is a decent solution.  Why not make the locking notes for 
VOP_UNPCONNECT() be "L" instead of "E"?  A read lock should be sufficient
to fetch the socket?  In fact, I suspect that unp_connect() could actually
use a shared lock on the vnode by adding 'LOCKSHARE' to the flags passed
to namei() via NDINIT().

-- 
John Baldwin