Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Jul 2013 04:02:15 +0000 (UTC)
From:      Xin LI <delphij@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-8@freebsd.org
Subject:   svn commit: r252763 - in stable/8/sys/cddl/contrib/opensolaris/uts/common: fs/zfs fs/zfs/sys sys/fm/fs
Message-ID:  <201307050402.r6542FrE073379@svn.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: delphij
Date: Fri Jul  5 04:02:15 2013
New Revision: 252763
URL: http://svnweb.freebsd.org/changeset/base/252763

Log:
  MFC r251636: illumos #3749 zfs event processing should work on R/O root
  filesystems
  
  This log is a modified version of the original one written by gibbs@,
  to account for changes made during the illumos RTI process.
  
  Allow ZFS asynchronous event handling to proceed even if the root file
  system is mounted read-only.  This restriction appears to have been put
  in place to avoid errors with updating the configuration cache file.
  However:
  
   o The majority of asynchronous event handling does not involve
     configuration cache file updates.
   o The configuration cache file need not be on the root file system,
     so the check was not complete.
   o Other classes of errors (e.g. file system full) can also prevent
     a successful update yet do not prevent asynchronous event processing.
   o Configurations such as NanoBSD never have a read-write root,
     so ZFS event processing is permanently disabled in these systems.
   o Failure to handle asynchronous events promptly can extend the
     window of time that a pool is in a critical state.
  
  At worst, a missed configuration cache update will force the operator to
  perform a manual "zfs import" (note -f is not required) to inform the
  system about a newly created pool.  To minimize the likelihood of this
  rare occurrence, configuration cache write failures now emit FMA events
  (via devctl) so the operator can take corrective action, and the write
  is retried every 5 minutes.  The retry interval, in seconds, is tunable
  via the sysctl "vfs.zfs.ccw_retry_interval".
  
  As a side effect of reporting configuration cache events, other sysevents,
  such as re-silver start/stop, are now also reported via devctl.
  
  sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
  	o As is done in zfs_fm.c, provide a manual declaration for
  	  devctl_notify().  Both declarations could be combined
  	  into spa_impl.h, but the declaration is fault management
  	  related, not spa specific.  sys/fm/fs/zfs.h would be ideal
  	  if it weren't so public and reserved for FMA string
  	  definitions.  I'm open to suggestions on how to improve
  	  this nit while minimizing our divergence from Solaris.
  	o Use devctl_notify() to implement sysevent support in
  	  spa_event_notify().  The subsystem is EC_ZFS so that
  	  these events can never collide with those emitted in
  	  zfs_fm.c.
  	o Add the sysctl "vfs.zfs.ccw_retry_interval".  The value
  	  defaults to 5 minutes and is used to rate limit, on a
  	  per-pool basis, configuration cache file write attempts.
  	o Modify spa_async_dispatch to honor configuration cache
  	  write limiting.  If other events are pending, a configuration
  	  cache write will be attempted at the same time, so the
  	  rate limiting only applies when the asynchronous dispatch
  	  system is otherwise idle.  Async events should be rare
  	  (e.g. device arrival/departure) and configuration cache
  	  writes rarer, so a more complicated system to strictly
  	  honor the retry limit seems unwarranted.
  	o Remove check in spa_async_dispatch() for the root file
  	  system being read-write.
  
  sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
  	Instead of silently ignoring configuration cache write
  	failures, report them via a new FMA event as well as
  	to the console.  The current zfs_ereport_post() doesn't
  	allow arbitrary name=value pairs to be appended to the
  	report, so the configuration cache file name is only
  	available on the console output.  This limitation should
  	be addressed in a future update.
  
  	Note: This error report is only posted once per incident,
  	to avoid spamming.
  
  sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h:
  	Add a hrtime_t to the spa data structure to track the
  	time (via gethrtime()) of the last configuration cache file
  	write failure.  This is referenced in spa_async_dispatch()
  	to effect the rate limiting.
  
  sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h:
  	  Add FM_EREPORT_ZFS_CONFIG_CACHE_WRITE as an ereport class.
  
  Submitted by:	gibbs
  Reviewed by:	Matthew Ahrens <mahrens@delphix.com>,
  		Eric Schrock <eric.schrock@delphix.com>,
  		Christopher Siden <christopher.siden@delphix.com>
  Sponsored by:	Spectra Logic

Modified:
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h
Directory Properties:
  stable/8/sys/   (props changed)
  stable/8/sys/cddl/   (props changed)
  stable/8/sys/cddl/contrib/opensolaris/   (props changed)

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	Fri Jul  5 04:01:25 2013	(r252762)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	Fri Jul  5 04:02:15 2013	(r252763)
@@ -88,6 +88,12 @@ TUNABLE_INT("vfs.zfs.check_hostid", &che
 SYSCTL_INT(_vfs_zfs, OID_AUTO, check_hostid, CTLFLAG_RW, &check_hostid, 0,
     "Check hostid on import?");
 
+/*
+ * The interval, in seconds, at which failed configuration cache file writes
+ * should be retried.
+ */
+static int zfs_ccw_retry_interval = 300;
+
 typedef enum zti_modes {
 	zti_mode_fixed,			/* value is # of threads (min 1) */
 	zti_mode_online_percent,	/* value is % of online CPUs */
@@ -5852,13 +5858,34 @@ spa_async_resume(spa_t *spa)
 	mutex_exit(&spa->spa_async_lock);
 }
 
+static boolean_t
+spa_async_tasks_pending(spa_t *spa)
+{
+	uint_t non_config_tasks;
+	uint_t config_task;
+	boolean_t config_task_suspended;
+
+	non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
+	config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
+	if (spa->spa_ccw_fail_time == 0) {
+		config_task_suspended = B_FALSE;
+	} else {
+		config_task_suspended =
+		    (gethrtime() - spa->spa_ccw_fail_time) <
+		    (zfs_ccw_retry_interval * NANOSEC);
+	}
+
+	return (non_config_tasks || (config_task && !config_task_suspended));
+}
+
 static void
 spa_async_dispatch(spa_t *spa)
 {
 	mutex_enter(&spa->spa_async_lock);
-	if (spa->spa_async_tasks && !spa->spa_async_suspended &&
+	if (spa_async_tasks_pending(spa) &&
+	    !spa->spa_async_suspended &&
 	    spa->spa_async_thread == NULL &&
-	    rootdir != NULL && !vn_is_readonly(rootdir))
+	    rootdir != NULL)
 		spa->spa_async_thread = thread_create(NULL, 0,
 		    spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
 	mutex_exit(&spa->spa_async_lock);

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c	Fri Jul  5 04:01:25 2013	(r252762)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c	Fri Jul  5 04:02:15 2013	(r252763)
@@ -27,6 +27,7 @@
 
 #include <sys/zfs_context.h>
 #include <sys/spa.h>
+#include <sys/fm/fs/zfs.h>
 #include <sys/spa_impl.h>
 #include <sys/nvpair.h>
 #include <sys/uio.h>
@@ -139,7 +140,7 @@ out:
 	kobj_close_file(file);
 }
 
-static void
+static int
 spa_config_write(spa_config_dirent_t *dp, nvlist_t *nvl)
 {
 	size_t buflen;
@@ -147,13 +148,14 @@ spa_config_write(spa_config_dirent_t *dp
 	vnode_t *vp;
 	int oflags = FWRITE | FTRUNC | FCREAT | FOFFMAX;
 	char *temp;
+	int err;
 
 	/*
 	 * If the nvlist is empty (NULL), then remove the old cachefile.
 	 */
 	if (nvl == NULL) {
-		(void) vn_remove(dp->scd_path, UIO_SYSSPACE, RMFILE);
-		return;
+		err = vn_remove(dp->scd_path, UIO_SYSSPACE, RMFILE);
+		return (err);
 	}
 
 	/*
@@ -174,12 +176,14 @@ spa_config_write(spa_config_dirent_t *dp
 	 */
 	(void) snprintf(temp, MAXPATHLEN, "%s.tmp", dp->scd_path);
 
-	if (vn_open(temp, UIO_SYSSPACE, oflags, 0644, &vp, CRCREAT, 0) == 0) {
-		if (vn_rdwr(UIO_WRITE, vp, buf, buflen, 0, UIO_SYSSPACE,
-		    0, RLIM64_INFINITY, kcred, NULL) == 0 &&
-		    VOP_FSYNC(vp, FSYNC, kcred, NULL) == 0) {
-			(void) vn_rename(temp, dp->scd_path, UIO_SYSSPACE);
-		}
+	err = vn_open(temp, UIO_SYSSPACE, oflags, 0644, &vp, CRCREAT, 0);
+	if (err == 0) {
+		err = vn_rdwr(UIO_WRITE, vp, buf, buflen, 0, UIO_SYSSPACE,
+		    0, RLIM64_INFINITY, kcred, NULL);
+		if (err == 0)
+			err = VOP_FSYNC(vp, FSYNC, kcred, NULL);
+		if (err == 0)
+			err = vn_rename(temp, dp->scd_path, UIO_SYSSPACE);
 		(void) VOP_CLOSE(vp, oflags, 1, 0, kcred, NULL);
 	}
 
@@ -187,6 +191,7 @@ spa_config_write(spa_config_dirent_t *dp
 
 	kmem_free(buf, buflen);
 	kmem_free(temp, MAXPATHLEN);
+	return (err);
 }
 
 /*
@@ -198,6 +203,8 @@ spa_config_sync(spa_t *target, boolean_t
 {
 	spa_config_dirent_t *dp, *tdp;
 	nvlist_t *nvl;
+	boolean_t ccw_failure;
+	int error;
 
 	ASSERT(MUTEX_HELD(&spa_namespace_lock));
 
@@ -209,6 +216,7 @@ spa_config_sync(spa_t *target, boolean_t
 	 * cachefile is changed, the new one is pushed onto this list, allowing
 	 * us to update previous cachefiles that no longer contain this pool.
 	 */
+	ccw_failure = B_FALSE;
 	for (dp = list_head(&target->spa_config_list); dp != NULL;
 	    dp = list_next(&target->spa_config_list, dp)) {
 		spa_t *spa = NULL;
@@ -241,10 +249,32 @@ spa_config_sync(spa_t *target, boolean_t
 			mutex_exit(&spa->spa_props_lock);
 		}
 
-		spa_config_write(dp, nvl);
+		error = spa_config_write(dp, nvl);
+		if (error != 0)
+			ccw_failure = B_TRUE;
 		nvlist_free(nvl);
 	}
 
+	if (ccw_failure) {
+		/*
+		 * Keep trying so that configuration data is
+		 * written if/when any temporary filesystem
+		 * resource issues are resolved.
+		 */
+		if (target->spa_ccw_fail_time == 0) {
+			zfs_ereport_post(FM_EREPORT_ZFS_CONFIG_CACHE_WRITE,
+			    target, NULL, NULL, 0, 0);
+		}
+		target->spa_ccw_fail_time = gethrtime();
+		spa_async_request(target, SPA_ASYNC_CONFIG_UPDATE);
+	} else {
+		/*
+		 * Do not rate limit future attempts to update
+		 * the config cache.
+		 */
+		target->spa_ccw_fail_time = 0;
+	}
+
 	/*
 	 * Remove any config entries older than the current one.
 	 */

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h	Fri Jul  5 04:01:25 2013	(r252762)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h	Fri Jul  5 04:02:15 2013	(r252763)
@@ -241,6 +241,7 @@ struct spa {
 	uint64_t	spa_deadman_calls;	/* number of deadman calls */
 	uint64_t	spa_sync_starttime;	/* starting time fo spa_sync */
 	uint64_t	spa_deadman_synctime;	/* deadman expiration timer */
+	hrtime_t	spa_ccw_fail_time;	/* Conf cache write fail time */
 	/*
 	 * spa_refcount & spa_config_lock must be the last elements
 	 * because refcount_t changes size based on compilation options.

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h	Fri Jul  5 04:01:25 2013	(r252762)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/sys/fm/fs/zfs.h	Fri Jul  5 04:02:15 2013	(r252763)
@@ -46,6 +46,7 @@ extern "C" {
 #define	FM_EREPORT_ZFS_IO_FAILURE		"io_failure"
 #define	FM_EREPORT_ZFS_PROBE_FAILURE		"probe_failure"
 #define	FM_EREPORT_ZFS_LOG_REPLAY		"log_replay"
+#define	FM_EREPORT_ZFS_CONFIG_CACHE_WRITE	"config_cache_write"
 
 #define	FM_EREPORT_PAYLOAD_ZFS_POOL		"pool"
 #define	FM_EREPORT_PAYLOAD_ZFS_POOL_FAILMODE	"pool_failmode"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201307050402.r6542FrE073379>