From owner-freebsd-current@FreeBSD.ORG Mon Nov 10 09:40:11 2014 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6DAA25DD; Mon, 10 Nov 2014 09:40:11 +0000 (UTC) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id E7E3C8D5; Mon, 10 Nov 2014 09:40:10 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id BCD3A7300A; Mon, 10 Nov 2014 10:44:12 +0100 (CET) Date: Mon, 10 Nov 2014 10:44:12 +0100 From: Luigi Rizzo To: Konstantin Belousov Subject: Re: dev_lock() contention for fdesc syscalls -- possible fix Message-ID: <20141110094412.GA25189@onelab2.iet.unipi.it> References: <20141110014939.GA21626@onelab2.iet.unipi.it> <20141110083457.GD53947@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141110083457.GD53947@kib.kiev.ua> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: jmg@freebsd.org, gnn@freebsd.org, adrian@freebsd.org, current@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Nov 2014 09:40:11 -0000 On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote: > On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: > > It was noticed that there is huge dev_lock() contention when multiple > > processes do a poll() even on independent file descriptors. > > > > Turns out that not just poll but most syscalls on file descriptors > > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including > > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem > > as they use the following pattern > > > > devfs_poll_f() { > > ... > > devfs_fp_check(fp, ...) --> > > kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) --> > > dev_lock(); > > dev = vp->v_rdev; // lock on vp ? > > ... check that dev != NULL > > atomic_add_long(&dev->si_threadcount, 1); > > dev_unlock(); > > dsw->d_poll() > > dev_relthread() --> > > atomic_subtract_rel_long(&dev->si_threadcount, 1); > > } > > > > > > I believe that dev_lock() in devvn_refthread() is protecting > > dev = vp->v_rdev > > (the cdev entry 'dev' cannot go away for the reasons stated below). > > > > However looking at places where vp->v_rdev is modified, turns out > > that it only happens when holding VI_LOCK(vp) -- the places are > > devfs_allocv() and devfs_reclaim(). > > There is one place in zfs where the vnode is created and v_rdev > > is set without holding a lock, so nobody can dereference it. > > > > As a consequence i believe that if in devfs_fp_check() we replace > > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), > > we make sure that we can safely dereference vp->v_rdev, and the > > cdev cannot go away because the vnode has a reference to it. > > The counter uses an atomic instruction (so the release is lockless) > Vnode reference, as well as cdev reference, which is obtained by > dev_ref(), do not provide any protection there. v_rdev is only > coincidentally protected by the vnode interlock. > > If you look at larger part of the code, you would note that dev mutex > is held even after v_rdev is dereferenced. The real protection it > provides is against race with destroy_dev(l)(), which could invalidate > dev->so_devsw at any moment when either device thread reference is > not held, or dev mutex is not held. So your patch breaks the > device destruction. I see. Thanks for the clarification. Would it help to rewrite the part of devvn_refthread as follows: devvn_refthread() { // protect vp->v_rdev dereference and dev disappearing VI_LOCK(vp); dev = vi->v_rdev; .. check that dev != NULL // protect the race on dev->si_devsw atomic_add_long(&dev->si_threadcount, 1); VI_UNLOCK(vp); ... appropriate memory barrier csw = dev->si_devsw; if (csw != NULL) { // we won the race, even if destroy_devl() runs // it will keep the cdevsw alive until si_threadcount goes to 0 *devp = dev; *ref = 1; } else { // we lost *ref = 0; } return csw; } i.e. tentatively increment si_threadcount before dereferencing si_devsw and then restoring it if we lose the race ? It might be necessary to add a barrier in destroy_devl() between clearing si_devsw and reading si_threadcount. > > > > This should be enough to remove the contention. > If you never calls destroy_dev() for the devfs node, you could use > MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there > is no risk of race with destroy_dev() for the created node. > > Usually, code which could be compiled in kernel statically but also > loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of > module needed to call destroy_dev(). that I suppose would mean that the module cannot be safely unloaded ? cheers luigi