From owner-freebsd-arch@FreeBSD.ORG Tue Jan 10 13:19:21 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 13111106566B; Tue, 10 Jan 2012 13:19:21 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id C5F2F8FC0A; Tue, 10 Jan 2012 13:19:20 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170]) by cyrus.watson.org (Postfix) with ESMTPSA id 69CC046B2C; Tue, 10 Jan 2012 08:19:20 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id E5432B91A; Tue, 10 Jan 2012 08:19:19 -0500 (EST) From: John Baldwin To: freebsd-arch@freebsd.org Date: Tue, 10 Jan 2012 08:19:18 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; ) References: <86sjjobzmn.fsf@kopusha.home.net> In-Reply-To: <86sjjobzmn.fsf@kopusha.home.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201201100819.18892.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 10 Jan 2012 08:19:20 -0500 (EST) Cc: Mikolaj Golub , arch@freebsd.org, Robert Watson , Kostik Belousov Subject: Re: unix domain sockets on nullfs(5) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Jan 2012 13:19:21 -0000 On Monday, January 09, 2012 11:37:52 am Mikolaj Golub wrote: > Hi, > > There is a longstanding problem with nullfs(5) that is unix sockets do > not work between lower and upper layers. > > See, e.g. kern/51583, kern/159663. > > On a unix socket binding the created socket is referenced in the vnode > field v_socket. This field is used on connect (from the vnode returned > by lookup). Unix socket functions like unp_bind/connect set/access > this field directly. > > This is the issue for nullfs, which uses two-layer vnode approach: > binding to the upper layer, the socket reference is stored in the > upper vnode; binding to the lower fs, the socket reference is stored > in the lower vnode and is not seen from the upper layer. > > E.g. having /mnt/upper nullfs mounted on /mnt/lower: > > 1) if we bind to /mnt/lower/test.sock we can connect only to > /mnt/lower/test.sock. > > 2) if we bind to /mnt/upper/test.sock we can connect only to > /mnt/upper/test.sock. > > The desired behavior is one can connect to both the lower and the > upper paths regardless if we bind to /mnt/lower/test.sock or > /mnt/upeer/test.sock. > > In kern/159663 two approaches were discussed: > > 1) copy the socket pointer from lower vnode to upper vnode on the > upper vnode get (fix the case when one binds to the lower fs and wants > to connect via the upper, but does not fix the case when one binds to > the upper and wants to connect via the lower fs); > > 2) make null_lookup/create return lower vnode for VSOCK vnodes. > > Both approaches have issues and looks rather hackish. > > kib@ suggested that the issue could be fixed if one added new VOP_* > operations for setting and accessing vnode's v_socket field. > > The attached patch implements this. It also can be found here: > > http://people.freebsd.org/~trociny/nullfs.VOP_UNP.4.patch > > It adds three VOP_* operations: VOP_UNPBIND, VOP_UNPCONNECT and > VOP_UNPDETACH. Their purpose can be understood from the modifications > in uipc_usrreq.c: > > - vp->v_socket = unp->unp_socket; > + VOP_UNPBIND(vp, unp->unp_socket); > > - so2 = vp->v_socket; > + VOP_UNPCONNECT(vp, &so2); > > - unp->unp_vnode->v_socket = NULL; > + VOP_UNPDETACH(unp->unp_vnode); > > The default functions just do these simple operations, while > filesystems like nullfs can do more complicated things. > > The patch also implements functions for nullfs. By default the old > behavior is preserved. To get the new behaviour the filesystem should > be (re)mounted with sobypass option. Then the socket operations are > bypassed to a lower vnode, which makes the socket be accessible from > both layers. > > I am very interested to hear other people opinion on this. I think this is a decent solution. Why not make the locking notes for VOP_UNPCONNECT() be "L" instead of "E"? A read lock should be sufficient to fetch the socket? In fact, I suspect that unp_connect() could actually use a shared lock on the vnode by adding 'LOCKSHARE' to the flags passed to namei() via NDINIT(). -- John Baldwin