From owner-svn-src-head@freebsd.org  Sat Jun  3 16:18:52 2017
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 68B0EBF4DA1;
 Sat,  3 Jun 2017 16:18:52 +0000 (UTC) (envelope-from kib@FreeBSD.org)
Received: from repo.freebsd.org (repo.freebsd.org
 [IPv6:2610:1c1:1:6068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3F79F75F4E;
 Sat,  3 Jun 2017 16:18:52 +0000 (UTC) (envelope-from kib@FreeBSD.org)
Received: from repo.freebsd.org ([127.0.1.37])
 by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v53GIpsu005433;
 Sat, 3 Jun 2017 16:18:51 GMT (envelope-from kib@FreeBSD.org)
Received: (from kib@localhost)
 by repo.freebsd.org (8.15.2/8.15.2/Submit) id v53GIpLC005431;
 Sat, 3 Jun 2017 16:18:51 GMT (envelope-from kib@FreeBSD.org)
Message-Id: <201706031618.v53GIpLC005431@repo.freebsd.org>
X-Authentication-Warning: repo.freebsd.org: kib set sender to kib@FreeBSD.org
 using -f
From: Konstantin Belousov <kib@FreeBSD.org>
Date: Sat, 3 Jun 2017 16:18:51 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: svn commit: r319539 - head/sys/ufs/ffs
X-SVN-Group: head
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Jun 2017 16:18:52 -0000

Author: kib
Date: Sat Jun  3 16:18:50 2017
New Revision: 319539
URL: https://svnweb.freebsd.org/changeset/base/319539

Log:
  Mitigate several problems with the softdep_request_cleanup() on busy
  host.
  
  Problems start appearing when there are several threads all doing
  operations on a UFS volume and the SU workqueue needs a cleanup.  It is
  possible that each thread calling softdep_request_cleanup() owns the
  lock for some dirty vnode (e.g. all of them are executing mkdir(2),
  mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
  by corresponding thread. Then, we get all the threads simultaneously
  entering softdep_request_cleanup().
  
  There are two problems:
  - Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel.  Due
    to the locking, they quickly start executing 'in phase' with the speed
    of the slowest thread.
  - Since each thread already owns the lock for a dirty vnode, other threads
    non-blocking attempt to lock the vnode owned by other thread fail,
    and loops executing without making the progress.
  Retry logic does not allow the situation to recover.  The result is
  a livelock.
  
  Fix these problems by making the following changes:
  - Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
    A new flag FLUSH_RC_ACTIVE guards the loop.
  - If there were failed locking attempts during the loop, abort retry
    even if there are still work items on the mp work list.  An
    assumption is that the items will be cleaned when other thread
    either fsyncs its vnode, or unlock and allow yet another thread to
    make the progress.
  
  It is possible now that some calls would get undeserved ENOSPC from
  ffs_alloc(), because the cleanup is not aggressive enough. But I do
  not see how can we reliably clean up workitems if calling
  softdep_request_cleanup() while still owning the vnode lock. I thought
  about scheme where ffs_alloc() returns ERESTART and saves the retry
  counter somewhere in struct thread, to return to the top level, unlock
  the vnode and retry.  But IMO the very rare (and unproven) spurious
  ENOSPC is not worth the complications.
  
  Reported and tested by:	pho
  Style and comments by:	mckusick
  Reviewed by:	mckusick
  Sponsored by:	The FreeBSD Foundation
  MFC after:	2 weeks

Modified:
  head/sys/ufs/ffs/ffs_softdep.c
  head/sys/ufs/ffs/softdep.h

Modified: head/sys/ufs/ffs/ffs_softdep.c
==============================================================================
--- head/sys/ufs/ffs/ffs_softdep.c	Sat Jun  3 15:56:54 2017	(r319538)
+++ head/sys/ufs/ffs/ffs_softdep.c	Sat Jun  3 16:18:50 2017	(r319539)
@@ -901,6 +901,7 @@ static	int pagedep_find(struct pagedep_hashhead *, ino
 	    struct pagedep **);
 static	void pause_timer(void *);
 static	int request_cleanup(struct mount *, int);
+static	int softdep_request_cleanup_flush(struct mount *, struct ufsmount *);
 static	void schedule_cleanup(struct mount *);
 static void softdep_ast_cleanup_proc(struct thread *);
 static	int process_worklist_item(struct mount *, int, int);
@@ -13274,10 +13275,9 @@ softdep_request_cleanup(fs, vp, cred, resource)
 {
 	struct ufsmount *ump;
 	struct mount *mp;
-	struct vnode *lvp, *mvp;
 	long starttime;
 	ufs2_daddr_t needed;
-	int error;
+	int error, failed_vnode;
 
 	/*
 	 * If we are being called because of a process doing a
@@ -13368,41 +13368,88 @@ retry:
 	 * to the worklist that we can then process to reap addition
 	 * resources. We walk the vnodes associated with the mount point
 	 * until we get the needed worklist requests that we can reap.
+	 *
+	 * If there are several threads all needing to clean the same
+	 * mount point, only one is allowed to walk the mount list.
+	 * When several threads all try to walk the same mount list,
+	 * they end up competing with each other and often end up in
+	 * livelock. This approach ensures that forward progress is
+	 * made at the cost of occational ENOSPC errors being returned
+	 * that might otherwise have been avoided.
 	 */
+	error = 1;
 	if ((resource == FLUSH_BLOCKS_WAIT && 
 	     fs->fs_cstotal.cs_nbfree <= needed) ||
 	    (resource == FLUSH_INODES_WAIT && fs->fs_pendinginodes > 0 &&
 	     fs->fs_cstotal.cs_nifree <= needed)) {
-		MNT_VNODE_FOREACH_ALL(lvp, mp, mvp) {
-			if (TAILQ_FIRST(&lvp->v_bufobj.bo_dirty.bv_hd) == 0) {
-				VI_UNLOCK(lvp);
-				continue;
+		ACQUIRE_LOCK(ump);
+		if ((ump->um_softdep->sd_flags & FLUSH_RC_ACTIVE) == 0) {
+			ump->um_softdep->sd_flags |= FLUSH_RC_ACTIVE;
+			FREE_LOCK(ump);
+			failed_vnode = softdep_request_cleanup_flush(mp, ump);
+			ACQUIRE_LOCK(ump);
+			ump->um_softdep->sd_flags &= ~FLUSH_RC_ACTIVE;
+			FREE_LOCK(ump);
+			if (ump->softdep_on_worklist > 0) {
+				stat_cleanup_retries += 1;
+				if (!failed_vnode)
+					goto retry;
 			}
-			if (vget(lvp, LK_EXCLUSIVE | LK_INTERLOCK | LK_NOWAIT,
-			    curthread))
-				continue;
-			if (lvp->v_vflag & VV_NOSYNC) {	/* unlinked */
-				vput(lvp);
-				continue;
-			}
-			(void) ffs_syncvnode(lvp, MNT_NOWAIT, 0);
-			vput(lvp);
+		} else {
+			FREE_LOCK(ump);
+			error = 0;
 		}
-		lvp = ump->um_devvp;
-		if (vn_lock(lvp, LK_EXCLUSIVE | LK_NOWAIT) == 0) {
-			VOP_FSYNC(lvp, MNT_NOWAIT, curthread);
-			VOP_UNLOCK(lvp, 0);
-		}
-		if (ump->softdep_on_worklist > 0) {
-			stat_cleanup_retries += 1;
-			goto retry;
-		}
 		stat_cleanup_failures += 1;
 	}
 	if (time_second - starttime > stat_cleanup_high_delay)
 		stat_cleanup_high_delay = time_second - starttime;
 	UFS_LOCK(ump);
-	return (1);
+	return (error);
+}
+
+/*
+ * Scan the vnodes for the specified mount point flushing out any
+ * vnodes that can be locked without waiting. Finally, try to flush
+ * the device associated with the mount point if it can be locked
+ * without waiting.
+ *
+ * We return 0 if we were able to lock every vnode in our scan.
+ * If we had to skip one or more vnodes, we return 1.
+ */
+static int
+softdep_request_cleanup_flush(mp, ump)
+	struct mount *mp;
+	struct ufsmount *ump;
+{
+	struct thread *td;
+	struct vnode *lvp, *mvp;
+	int failed_vnode;
+
+	failed_vnode = 0;
+	td = curthread;
+	MNT_VNODE_FOREACH_ALL(lvp, mp, mvp) {
+		if (TAILQ_FIRST(&lvp->v_bufobj.bo_dirty.bv_hd) == 0) {
+			VI_UNLOCK(lvp);
+			continue;
+		}
+		if (vget(lvp, LK_EXCLUSIVE | LK_INTERLOCK | LK_NOWAIT,
+		    td) != 0) {
+			failed_vnode = 1;
+			continue;
+		}
+		if (lvp->v_vflag & VV_NOSYNC) {	/* unlinked */
+			vput(lvp);
+			continue;
+		}
+		(void) ffs_syncvnode(lvp, MNT_NOWAIT, 0);
+		vput(lvp);
+	}
+	lvp = ump->um_devvp;
+	if (vn_lock(lvp, LK_EXCLUSIVE | LK_NOWAIT) == 0) {
+		VOP_FSYNC(lvp, MNT_NOWAIT, td);
+		VOP_UNLOCK(lvp, 0);
+	}
+	return (failed_vnode);
 }
 
 static bool

Modified: head/sys/ufs/ffs/softdep.h
==============================================================================
--- head/sys/ufs/ffs/softdep.h	Sat Jun  3 15:56:54 2017	(r319538)
+++ head/sys/ufs/ffs/softdep.h	Sat Jun  3 16:18:50 2017	(r319539)
@@ -1065,6 +1065,7 @@ struct mount_softdeps {
 #define FLUSH_EXIT	0x0001	/* time to exit */
 #define FLUSH_CLEANUP	0x0002	/* need to clear out softdep structures */
 #define	FLUSH_STARTING	0x0004	/* flush thread not yet started */
+#define	FLUSH_RC_ACTIVE	0x0008	/* a thread is flushing the mount point */
 
 /*
  * Keep the old names from when these were in the ufsmount structure.