From owner-freebsd-fs@freebsd.org  Tue Jul  7 12:38:19 2015
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F361A9954A8
 for <freebsd-fs@mailman.ysv.freebsd.org>; Tue,  7 Jul 2015 12:38:19 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 999B21EFA
 for <freebsd-fs@freebsd.org>; Tue,  7 Jul 2015 12:38:19 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t67Cc90P008779
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 7 Jul 2015 15:38:09 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t67Cc90P008779
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id t67Cc90i008778;
 Tue, 7 Jul 2015 15:38:09 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Tue, 7 Jul 2015 15:38:09 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: freebsd-fs@freebsd.org
Subject: Re: atomic v_usecount and v_holdcnt
Message-ID: <20150707123809.GH2080@kib.kiev.ua>
References: <20141122002812.GA32289@dft-labs.eu>
 <20141122092527.GT17068@kib.kiev.ua>
 <20141122211147.GA23623@dft-labs.eu>
 <20141124095251.GH17068@kib.kiev.ua>
 <20150314225226.GA15302@dft-labs.eu>
 <20150316094643.GZ2379@kib.kiev.ua>
 <20150317014412.GA10819@dft-labs.eu>
 <20150318104442.GS2379@kib.kiev.ua>
 <20150625123156.GA29667@dft-labs.eu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150625123156.GA29667@dft-labs.eu>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jul 2015 12:38:20 -0000

On Thu, Jun 25, 2015 at 02:31:57PM +0200, Mateusz Guzik wrote:
> On Wed, Mar 18, 2015 at 12:44:42PM +0200, Konstantin Belousov wrote:
> > On Tue, Mar 17, 2015 at 02:44:12AM +0100, Mateusz Guzik wrote:
> > > On Mon, Mar 16, 2015 at 11:46:43AM +0200, Konstantin Belousov wrote:
> > > > On Sat, Mar 14, 2015 at 11:52:26PM +0100, Mateusz Guzik wrote:
> > > > > On Mon, Nov 24, 2014 at 11:52:52AM +0200, Konstantin Belousov wrote:
> > > > > > On Sat, Nov 22, 2014 at 10:11:47PM +0100, Mateusz Guzik wrote:
> > > > > > > On Sat, Nov 22, 2014 at 11:25:27AM +0200, Konstantin Belousov wrote:
> > > > > > > > On Sat, Nov 22, 2014 at 01:28:12AM +0100, Mateusz Guzik wrote:
> > > > > > > > > The idea is that we don't need an interlock as long as we don't
> > > > > > > > > transition either counter 1->0 or 0->1.
> > > > > > > > I already said that something along the lines of the patch should work.
> > > > > > > > In fact, you need vnode lock when hold count changes between 0 and 1,
> > > > > > > > and probably the same for use count.
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't see why this would be required (not that I'm an VFS expert).
> > > > > > > vnode recycling seems to be protected with the interlock.
> > > > > > > 
> > > > > > > In fact I would argue that if this is really needed, current code is
> > > > > > > buggy.
> > > > > > Yes, it is already (somewhat) buggy.
> > > > > > 
> > > > > > Most need of the lock is for the case of counts coming from 1 to 0.
> > > > > > The reason is the handling of the active vnode list, which is used
> > > > > > for limiting the amount of vnode list walking in syncer.  When hold
> > > > > > count is decremented to 0, vnode is removed from the active list.
> > > > > > When use count is decremented to 0, vnode is supposedly inactivated,
> > > > > > and vinactive() cleans the cached pages belonging to vnode.  In other
> > > > > > words, VI_OWEINACT for dirty vnode is sort of bug.
> > > > > > 
> > > > > 
> > > > > Modified the patch to no longer have the usecount + interlock dropped +
> > > > > VI_OWEINACT set window.
> > > > > 
> > > > > Extended 0->1 hold count + vnode not locked window remains. I can fix
> > > > > that if it is really necessary by having _vhold return with interlock
> > > > > held if it did such transition.
> > > > 
> > > > In v_upgrade_usecount(), you call v_incr_devcount() without without interlock
> > > > held.  What prevents the devfs vnode from being recycled, in particular,
> > > > from invalidation of v_rdev pointer ?
> > > > 
> > > 
> > > Right, that was buggy. Fixed in the patch below.
> > Why non-atomicity of updates to several counters is safe ?  This at least
> > requires an explanation in the comment, I mean holdcnt/usecnt pair.
> > 
> 
> The patch below was tested with make -j 40 buildworld in a loop for 7 hours
> and it survived.
> 
> I started a comment above vget, unfinished yet.
Indeed, the following:
 * The hold count prevents the vnode from being freed, while the
 * use count prevents it from being recycled.
is not true.  usecount > 0 does not prevent recycling, it only gives a hint
to VFS to try to avoid recycling, but could be ignored, e.g. for forced
unmount.

> 
> Further playing around revealed that zfs will vref a vnode with no
> usecount (zfs_lookup -> zfs_dirlook -> zfs_dirent_lock -> zfs_zget ->
> VN_HOLD) and it is possible that it will have VI_OWEINACT set (tested on
> a kernel without my patch). VN_HOLD is defined as vref(). The code can
> sleep, so some shuffling around can be done to call vinactive() if it
> happens to be exclusively locked (but most of the time it is locked
> shared).
> 
> However, it seems that vputx deals with such consumers:
> if (vp->v_usecount > 0)
> 	vp->v_iflag &= ~VI_OWEINACT;
> 
> Given that there are possibly more consumers like zfs how about:
> In vputx assert that the flag is unset if the usecount went to > 0. Clear
> the flag in vref and vget if transitioning 0->1 and assert it is unset
> otherwise.
This should be fine, I would only be concerned with VI_OWEINACT not cleared
for unreferenced vnodes, see r284495.

> 
> The way I read it is that in the stock kernel with properly timed vref
> the flag would be cleared anyway, with vinactive() only called if it was
> done by vget and only with the vnode exclusively locked.
> 
> With a aforementioned change likelyhood of vinactive() remains the same,
> but now the flag state can be asserted.
> 
> > Assume the thread increased the v_usecount, but did not managed to
> > acquire dev_mtx. Another thread performs vrele() and progressed to
> > v_decr_devcount(). It decreases the si_usecount, which might allow yet
> > another thread to see the si_usecount as too low and start unwanted
> > action. I think that the tests for VCHR must be done at the very
> > start of the functions, and devfs vnodes must hold vnode interlock
> > unconditionally.
> > 
> 
> Inserted v_type != VCHR checks in relevant places, vi_usecount
> manipulation functions now assert that the interlock is held.
Ok.

> 
> > > 
> > > > I think that refcount_acquire_if_greater() KPI is excessive.  You always
> > > > calls acquire with val == 0, and release with val == 1.
> > > > 
> > > 
> > > Yea i noted in my prevoius e-mail it should be changed (see below).
> > > 
> > > I replaced them with refcount_acquire_if_not_zero and
> > > refcount_release_if_not_last.
> > I dislike the length of the names.  Can you propose something shorter ?
> > 
> 
> Unfortunately the original API is alreday quite verbose and I don't have
> anything readable which would retain "refcount_acquire" (instead of a
> "ref_get" or "ref_acq"). Adding "_nz" as a suffix does not look good
> ("refcount_acquire_if_nz").
There were some proposals given ?

BTW, refcount_acquire() has an aquire semantic.  It is debatable whether
the acq is needed there, or not.  But your if_not functions are not acq.

> 
> > The type for the local variable old in both functions should be u_int.
> > 
> 
> Done.
> 
> > > 
> > > > WRT to _refcount_release_lock, why is lock_object->lc_lock/lc_unlock KPI
> > > > cannot be used ? This allows to make refcount_release_lock() a function
> > > > instead of gcc extension macros.  Not to mention that the macro is unused.
> > > 
> > > These were supposed to be used by other code, forgot to remove it from
> > > the patch I sent here.
> > > 
> > > We can discuss this in another thread.
> > > 
> > > Striclty speaking we could use it here for vnode interlock, but I did
> > > not want to get around VI_LOCK macro (which right now is just a
> > > mtx_lock, but this may change).
> > > 
> > > Updated patch is below:
> > Do not introduce ASSERT_VI_LOCK, the name difference between
> > ASSERT_VI_LOCKED and ASSERT_VI_LOCK is only in the broken grammar.
> > I do not see anything wrong with explicit if() statements where needed,
> > in all four places.
> 
> Done.
> 
> > 
> > In vputx(), wrap the long line (if (refcount_release() || VI_DOINGINACT)).
> 
> Done.

Overall, the patch looks fine.

I still have to convince myself each time I read the following changes 
-		VI_LOCK(ddvp);
+		vhold(ddvp);
 		CACHE_RUNLOCK();
-		if (vget(ddvp, LK_INTERLOCK | LK_SHARED | LK_NOWAIT, curthread))
+		if (vget(ddvp, LK_SHARED | LK_NOWAIT | LK_VNHELD, curthread))
but I indeed cannot find anything wrong.  I.e. before it could fail if
the vnode is locked, now it can fail if the vnode is locked or reclaimed
meantime, which is surprisingly not harmful.