From owner-freebsd-current@FreeBSD.ORG  Mon Nov 10 09:40:11 2014
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 6DAA25DD;
 Mon, 10 Nov 2014 09:40:11 +0000 (UTC)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
 by mx1.freebsd.org (Postfix) with ESMTP id E7E3C8D5;
 Mon, 10 Nov 2014 09:40:10 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
 id BCD3A7300A; Mon, 10 Nov 2014 10:44:12 +0100 (CET)
Date: Mon, 10 Nov 2014 10:44:12 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: dev_lock() contention for fdesc syscalls -- possible fix
Message-ID: <20141110094412.GA25189@onelab2.iet.unipi.it>
References: <20141110014939.GA21626@onelab2.iet.unipi.it>
 <20141110083457.GD53947@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20141110083457.GD53947@kib.kiev.ua>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: jmg@freebsd.org, gnn@freebsd.org, adrian@freebsd.org, current@freebsd.org
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Nov 2014 09:40:11 -0000

On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
> On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> > It was noticed that there is huge dev_lock() contention when multiple
> > processes do a poll() even on independent file descriptors.
> > 
> > Turns out that not just poll but most syscalls on file descriptors
> > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> > as they use the following pattern
> > 
> >     devfs_poll_f() {
> >         ...
> >         devfs_fp_check(fp, ...) -->
> >             kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
> >                 dev_lock();
> >                 dev = vp->v_rdev; // lock on vp ?
> >                 ... check that dev != NULL
> >                 atomic_add_long(&dev->si_threadcount, 1);
> >                 dev_unlock();
> >         dsw->d_poll()
> >         dev_relthread() -->
> >             atomic_subtract_rel_long(&dev->si_threadcount, 1);
> >    }
> > 
> > 
> > I believe that dev_lock() in devvn_refthread() is protecting
> >         dev = vp->v_rdev
> > (the cdev entry 'dev' cannot go away for the reasons stated below).
> > 
> > However looking at places where vp->v_rdev is modified, turns out
> > that it only happens when holding VI_LOCK(vp) -- the places are
> > devfs_allocv() and devfs_reclaim().
> > There is one place in zfs where the vnode is created and v_rdev
> > is set without holding a lock, so nobody can dereference it.
> > 
> > As a consequence i believe that if in devfs_fp_check() we replace
> > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> > we make sure that we can safely dereference vp->v_rdev, and the
> > cdev cannot go away because the vnode has a reference to it.
> > The counter uses an atomic instruction (so the release is lockless)
> Vnode reference, as well as cdev reference, which is obtained by
> dev_ref(), do not provide any protection there.  v_rdev is only
> coincidentally protected by the vnode interlock.
> 
> If you look at larger part of the code, you would note that dev mutex
> is held even after v_rdev is dereferenced.  The real protection it
> provides is against race with destroy_dev(l)(), which could invalidate
> dev->so_devsw at any moment when either device thread reference is
> not held, or dev mutex is not held.  So your patch breaks the
> device destruction.

I see. Thanks for the clarification.

Would it help to rewrite the part of devvn_refthread as follows:

    devvn_refthread() {
	// protect vp->v_rdev dereference and dev disappearing
	VI_LOCK(vp);
	dev = vi->v_rdev;
	.. check that dev != NULL
	// protect the race on dev->si_devsw
	atomic_add_long(&dev->si_threadcount, 1);
	VI_UNLOCK(vp);
	... appropriate memory barrier
	csw = dev->si_devsw;
	if (csw != NULL) {
		// we won the race, even if destroy_devl() runs
		// it will keep the cdevsw alive until si_threadcount goes to 0
		*devp = dev;
		*ref = 1;
	} else {
		// we lost
		*ref = 0;
	}
	return csw;
    }

i.e.  tentatively increment si_threadcount before dereferencing si_devsw
and then restoring it if we lose the race ?
It might be necessary to add a barrier in destroy_devl() between clearing
si_devsw and reading si_threadcount.

> > 
> > This should be enough to remove the contention.
> If you never calls destroy_dev() for the devfs node, you could use
> MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
> is no risk of race with destroy_dev() for the created node.
> 
> Usually, code which could be compiled in kernel statically but also
> loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
> module needed to call destroy_dev().

that I suppose would mean that the module cannot be safely unloaded ?

cheers
luigi