From owner-freebsd-current@FreeBSD.ORG  Mon Nov 10 10:34:19 2014
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id CD764194;
 Mon, 10 Nov 2014 10:34:19 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 70D88E2C;
 Mon, 10 Nov 2014 10:34:19 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id sAAAYEMo001958
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 10 Nov 2014 12:34:14 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua sAAAYEMo001958
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id sAAAYDOI001957;
 Mon, 10 Nov 2014 12:34:13 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Mon, 10 Nov 2014 12:34:13 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: dev_lock() contention for fdesc syscalls -- possible fix
Message-ID: <20141110103412.GG53947@kib.kiev.ua>
References: <20141110014939.GA21626@onelab2.iet.unipi.it>
 <20141110083457.GD53947@kib.kiev.ua>
 <20141110094412.GA25189@onelab2.iet.unipi.it>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20141110094412.GA25189@onelab2.iet.unipi.it>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
Cc: jmg@freebsd.org, gnn@freebsd.org, adrian@freebsd.org, current@freebsd.org
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Nov 2014 10:34:19 -0000

On Mon, Nov 10, 2014 at 10:44:12AM +0100, Luigi Rizzo wrote:
> On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
> > On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> > > It was noticed that there is huge dev_lock() contention when multiple
> > > processes do a poll() even on independent file descriptors.
> > > 
> > > Turns out that not just poll but most syscalls on file descriptors
> > > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> > > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> > > as they use the following pattern
> > > 
> > >     devfs_poll_f() {
> > >         ...
> > >         devfs_fp_check(fp, ...) -->
> > >             kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
> > >                 dev_lock();
> > >                 dev = vp->v_rdev; // lock on vp ?
> > >                 ... check that dev != NULL
> > >                 atomic_add_long(&dev->si_threadcount, 1);
> > >                 dev_unlock();
> > >         dsw->d_poll()
> > >         dev_relthread() -->
> > >             atomic_subtract_rel_long(&dev->si_threadcount, 1);
> > >    }
> > > 
> > > 
> > > I believe that dev_lock() in devvn_refthread() is protecting
> > >         dev = vp->v_rdev
> > > (the cdev entry 'dev' cannot go away for the reasons stated below).
> > > 
> > > However looking at places where vp->v_rdev is modified, turns out
> > > that it only happens when holding VI_LOCK(vp) -- the places are
> > > devfs_allocv() and devfs_reclaim().
> > > There is one place in zfs where the vnode is created and v_rdev
> > > is set without holding a lock, so nobody can dereference it.
> > > 
> > > As a consequence i believe that if in devfs_fp_check() we replace
> > > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> > > we make sure that we can safely dereference vp->v_rdev, and the
> > > cdev cannot go away because the vnode has a reference to it.
> > > The counter uses an atomic instruction (so the release is lockless)
> > Vnode reference, as well as cdev reference, which is obtained by
> > dev_ref(), do not provide any protection there.  v_rdev is only
> > coincidentally protected by the vnode interlock.
> > 
> > If you look at larger part of the code, you would note that dev mutex
> > is held even after v_rdev is dereferenced.  The real protection it
> > provides is against race with destroy_dev(l)(), which could invalidate
> > dev->so_devsw at any moment when either device thread reference is
> > not held, or dev mutex is not held.  So your patch breaks the
> > device destruction.
> 
> I see. Thanks for the clarification.
> 
> Would it help to rewrite the part of devvn_refthread as follows:
> 
>     devvn_refthread() {
> 	// protect vp->v_rdev dereference and dev disappearing
> 	VI_LOCK(vp);
> 	dev = vi->v_rdev;
> 	.. check that dev != NULL
> 	// protect the race on dev->si_devsw
> 	atomic_add_long(&dev->si_threadcount, 1);
> 	VI_UNLOCK(vp);
> 	... appropriate memory barrier
> 	csw = dev->si_devsw;
> 	if (csw != NULL) {
> 		// we won the race, even if destroy_devl() runs
> 		// it will keep the cdevsw alive until si_threadcount goes to 0
> 		*devp = dev;
> 		*ref = 1;
> 	} else {
> 		// we lost
> 		*ref = 0;
> 	}
> 	return csw;
>     }
> 
> i.e.  tentatively increment si_threadcount before dereferencing si_devsw
> and then restoring it if we lose the race ?
> It might be necessary to add a barrier in destroy_devl() between clearing
> si_devsw and reading si_threadcount.
Sure it is neccessary to add a barrier in destroy_devl() then.

>From the first look, this might work, but how would you handle the
possibility that cdev memory is already freed when you do the increment ?
The vnode interlock does not protect against it; if you mean that
vnode reclamation already happen and vp->v_rdev must be cleared, this
might need some additional argumentation.

Note that the real patch is more involved, since both dev_refthread()
and devvn_refthread() should be handled. Also, there is some ugly hack
in sound/clone.c.

> 
> > > 
> > > This should be enough to remove the contention.
> > If you never calls destroy_dev() for the devfs node, you could use
> > MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
> > is no risk of race with destroy_dev() for the created node.
> > 
> > Usually, code which could be compiled in kernel statically but also
> > loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
> > module needed to call destroy_dev().
> 
> that I suppose would mean that the module cannot be safely unloaded ?

Well, you are not supposed to use MAKEDEV_ETERNAL for loadable modules
at all.  This is what MAKEDEV_ETERNAL_KLD ensures.