Date: Tue, 31 Jul 2018 21:06:05 +0000 (UTC) From: Alexander Motin <mav@FreeBSD.org> To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r337007 - in head: cddl/contrib/opensolaris/cmd/zpool cddl/contrib/opensolaris/cmd/ztest cddl/contrib/opensolaris/lib/libzfs/common cddl/contrib/opensolaris/lib/libzfs_core/common sys/c... Message-ID: <201807312106.w6VL65J6038222@repo.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: mav Date: Tue Jul 31 21:06:04 2018 New Revision: 337007 URL: https://svnweb.freebsd.org/changeset/base/337007 Log: MFV r336991, r337001: 9102 zfs should be able to initialize storage devices The first access to a disk block can incur a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that volumes be "thick provisioned", where supported by the platform (VMware). Thick provisioning is time consuming and often is ignored. If the thick provision step is omitted, customers will see suboptimal performance until we have written to all parts of the LUN. ZFS should be able to initialize any unused storage to remove any first-write penalty that exists. illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516 Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com> Added: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_initialize.h - copied unchanged from r337001, vendor-sys/illumos/dist/uts/common/fs/zfs/sys/vdev_initialize.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_initialize.c - copied, changed from r337001, vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_initialize.c Modified: head/cddl/contrib/opensolaris/cmd/zpool/zpool.8 head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c head/cddl/contrib/opensolaris/cmd/ztest/ztest.c head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h head/sys/cddl/contrib/opensolaris/uts/common/Makefile.files head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_priority.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_removal.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c head/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h head/sys/conf/files Directory Properties: head/cddl/contrib/opensolaris/ (props changed) head/cddl/contrib/opensolaris/lib/libzfs/ (props changed) head/sys/cddl/contrib/opensolaris/ (props changed) Modified: head/cddl/contrib/opensolaris/cmd/zpool/zpool.8 ============================================================================== --- head/cddl/contrib/opensolaris/cmd/zpool/zpool.8 Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/cmd/zpool/zpool.8 Tue Jul 31 21:06:04 2018 (r337007) @@ -121,6 +121,11 @@ .Ar pool | id .Op Ar newpool .Nm +.Cm initialize +.Op Fl cs +.Ar pool +.Op Ar device Ns ... +.Nm .Cm iostat .Op Fl T Cm d Ns | Ns Cm u .Op Fl v @@ -1434,6 +1439,32 @@ mounting option is enabled. In this case, the checkpointed state of the pool is opened and an administrator can see how the pool would look like if they were to fully rewind. +.El +.It Xo +.Nm +.Cm initialize +.Op Fl cs +.Ar pool +.Op Ar device Ns ... +.Xc +Begins initializing by writing to all unallocated regions on the specified +devices, or all eligible devices in the pool if no individual devices are +specified. +Only leaf data or log devices may be initialized. +.Bl -tag -width Ds +.It Fl c, -cancel +Cancel initializing on the specified devices, or all eligible devices if none +are specified. +If one or more target devices are invalid or are not currently being +initialized, the command will fail and no cancellation will occur on any device. +.It Fl s -suspend +Suspend initializing on the specified devices, or all eligible devices if none +are specified. +If one or more target devices are invalid or are not currently being +initialized, the command will fail and no suspension will occur on any device. +Initializing can then be resumed by running +.Nm zpool Cm initialize +with no flags on the relevant target devices. .El .It Xo .Nm Modified: head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c ============================================================================== --- head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c Tue Jul 31 21:06:04 2018 (r337007) @@ -87,6 +87,7 @@ static int zpool_do_detach(int, char **); static int zpool_do_replace(int, char **); static int zpool_do_split(int, char **); +static int zpool_do_initialize(int, char **); static int zpool_do_scrub(int, char **); static int zpool_do_import(int, char **); @@ -136,6 +137,7 @@ typedef enum { HELP_ONLINE, HELP_REPLACE, HELP_REMOVE, + HELP_INITIALIZE, HELP_SCRUB, HELP_STATUS, HELP_UPGRADE, @@ -187,6 +189,7 @@ static zpool_command_t command_table[] = { { "replace", zpool_do_replace, HELP_REPLACE }, { "split", zpool_do_split, HELP_SPLIT }, { NULL }, + { "initialize", zpool_do_initialize, HELP_INITIALIZE }, { "scrub", zpool_do_scrub, HELP_SCRUB }, { NULL }, { "import", zpool_do_import, HELP_IMPORT }, @@ -261,6 +264,8 @@ get_usage(zpool_help_t idx) return (gettext("\tremove [-nps] <pool> <device> ...\n")); case HELP_REOPEN: return (gettext("\treopen <pool>\n")); + case HELP_INITIALIZE: + return (gettext("\tinitialize [-cs] <pool> [<device> ...]\n")); case HELP_SCRUB: return (gettext("\tscrub [-s | -p] <pool> ...\n")); case HELP_STATUS: @@ -1650,6 +1655,43 @@ print_status_config(zpool_handle_t *zhp, const char *n "resilvering" : "repairing"); } + if ((vs->vs_initialize_state == VDEV_INITIALIZE_ACTIVE || + vs->vs_initialize_state == VDEV_INITIALIZE_SUSPENDED || + vs->vs_initialize_state == VDEV_INITIALIZE_COMPLETE) && + !vs->vs_scan_removing) { + char zbuf[1024]; + char tbuf[256]; + struct tm zaction_ts; + + time_t t = vs->vs_initialize_action_time; + int initialize_pct = 100; + if (vs->vs_initialize_state != VDEV_INITIALIZE_COMPLETE) { + initialize_pct = (vs->vs_initialize_bytes_done * 100 / + (vs->vs_initialize_bytes_est + 1)); + } + + (void) localtime_r(&t, &zaction_ts); + (void) strftime(tbuf, sizeof (tbuf), "%c", &zaction_ts); + + switch (vs->vs_initialize_state) { + case VDEV_INITIALIZE_SUSPENDED: + (void) snprintf(zbuf, sizeof (zbuf), + ", suspended, started at %s", tbuf); + break; + case VDEV_INITIALIZE_ACTIVE: + (void) snprintf(zbuf, sizeof (zbuf), + ", started at %s", tbuf); + break; + case VDEV_INITIALIZE_COMPLETE: + (void) snprintf(zbuf, sizeof (zbuf), + ", completed at %s", tbuf); + break; + } + + (void) printf(gettext(" (%d%% initialized%s)"), + initialize_pct, zbuf); + } + (void) printf("\n"); for (c = 0; c < children; c++) { @@ -4236,6 +4278,119 @@ zpool_do_scrub(int argc, char **argv) } return (for_each_pool(argc, argv, B_TRUE, NULL, scrub_callback, &cb)); +} + +static void +zpool_collect_leaves(zpool_handle_t *zhp, nvlist_t *nvroot, nvlist_t *res) +{ + uint_t children = 0; + nvlist_t **child; + uint_t i; + + (void) nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_CHILDREN, + &child, &children); + + if (children == 0) { + char *path = zpool_vdev_name(g_zfs, zhp, nvroot, B_FALSE); + fnvlist_add_boolean(res, path); + free(path); + return; + } + + for (i = 0; i < children; i++) { + zpool_collect_leaves(zhp, child[i], res); + } +} + +/* + * zpool initialize [-cs] <pool> [<vdev> ...] + * Initialize all unused blocks in the specified vdevs, or all vdevs in the pool + * if none specified. + * + * -c Cancel. Ends active initializing. + * -s Suspend. Initializing can then be restarted with no flags. + */ +int +zpool_do_initialize(int argc, char **argv) +{ + int c; + char *poolname; + zpool_handle_t *zhp; + nvlist_t *vdevs; + int err = 0; + + struct option long_options[] = { + {"cancel", no_argument, NULL, 'c'}, + {"suspend", no_argument, NULL, 's'}, + {0, 0, 0, 0} + }; + + pool_initialize_func_t cmd_type = POOL_INITIALIZE_DO; + while ((c = getopt_long(argc, argv, "cs", long_options, NULL)) != -1) { + switch (c) { + case 'c': + if (cmd_type != POOL_INITIALIZE_DO) { + (void) fprintf(stderr, gettext("-c cannot be " + "combined with other options\n")); + usage(B_FALSE); + } + cmd_type = POOL_INITIALIZE_CANCEL; + break; + case 's': + if (cmd_type != POOL_INITIALIZE_DO) { + (void) fprintf(stderr, gettext("-s cannot be " + "combined with other options\n")); + usage(B_FALSE); + } + cmd_type = POOL_INITIALIZE_SUSPEND; + break; + case '?': + if (optopt != 0) { + (void) fprintf(stderr, + gettext("invalid option '%c'\n"), optopt); + } else { + (void) fprintf(stderr, + gettext("invalid option '%s'\n"), + argv[optind - 1]); + } + usage(B_FALSE); + } + } + + argc -= optind; + argv += optind; + + if (argc < 1) { + (void) fprintf(stderr, gettext("missing pool name argument\n")); + usage(B_FALSE); + return (-1); + } + + poolname = argv[0]; + zhp = zpool_open(g_zfs, poolname); + if (zhp == NULL) + return (-1); + + vdevs = fnvlist_alloc(); + if (argc == 1) { + /* no individual leaf vdevs specified, so add them all */ + nvlist_t *config = zpool_get_config(zhp, NULL); + nvlist_t *nvroot = fnvlist_lookup_nvlist(config, + ZPOOL_CONFIG_VDEV_TREE); + zpool_collect_leaves(zhp, nvroot, vdevs); + } else { + int i; + for (i = 1; i < argc; i++) { + fnvlist_add_boolean(vdevs, argv[i]); + } + } + + err = zpool_initialize(zhp, cmd_type, vdevs); + + fnvlist_free(vdevs); + zpool_close(zhp); + + return (err); } typedef struct status_cbdata { Modified: head/cddl/contrib/opensolaris/cmd/ztest/ztest.c ============================================================================== --- head/cddl/contrib/opensolaris/cmd/ztest/ztest.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/cmd/ztest/ztest.c Tue Jul 31 21:06:04 2018 (r337007) @@ -104,6 +104,7 @@ #include <sys/zil_impl.h> #include <sys/vdev_impl.h> #include <sys/vdev_file.h> +#include <sys/vdev_initialize.h> #include <sys/spa_impl.h> #include <sys/metaslab_impl.h> #include <sys/dsl_prop.h> @@ -348,6 +349,7 @@ ztest_func_t ztest_spa_upgrade; ztest_func_t ztest_device_removal; ztest_func_t ztest_remap_blocks; ztest_func_t ztest_spa_checkpoint_create_discard; +ztest_func_t ztest_initialize; uint64_t zopt_always = 0ULL * NANOSEC; /* all the time */ uint64_t zopt_incessant = 1ULL * NANOSEC / 10; /* every 1/10 second */ @@ -391,7 +393,8 @@ ztest_info_t ztest_info[] = { &ztest_opts.zo_vdevtime }, { ztest_device_removal, 1, &zopt_sometimes }, { ztest_remap_blocks, 1, &zopt_sometimes }, - { ztest_spa_checkpoint_create_discard, 1, &zopt_rarely } + { ztest_spa_checkpoint_create_discard, 1, &zopt_rarely }, + { ztest_initialize, 1, &zopt_sometimes } }; #define ZTEST_FUNCS (sizeof (ztest_info) / sizeof (ztest_info_t)) @@ -5469,6 +5472,97 @@ ztest_spa_rename(ztest_ds_t *zd, uint64_t id) umem_free(newname, strlen(newname) + 1); rw_exit(&ztest_name_lock); +} + +static vdev_t * +ztest_random_concrete_vdev_leaf(vdev_t *vd) +{ + if (vd == NULL) + return (NULL); + + if (vd->vdev_children == 0) + return (vd); + + vdev_t *eligible[vd->vdev_children]; + int eligible_idx = 0, i; + for (i = 0; i < vd->vdev_children; i++) { + vdev_t *cvd = vd->vdev_child[i]; + if (cvd->vdev_top->vdev_removing) + continue; + if (cvd->vdev_children > 0 || + (vdev_is_concrete(cvd) && !cvd->vdev_detached)) { + eligible[eligible_idx++] = cvd; + } + } + VERIFY(eligible_idx > 0); + + uint64_t child_no = ztest_random(eligible_idx); + return (ztest_random_concrete_vdev_leaf(eligible[child_no])); +} + +/* ARGSUSED */ +void +ztest_initialize(ztest_ds_t *zd, uint64_t id) +{ + spa_t *spa = ztest_spa; + int error = 0; + + mutex_enter(&ztest_vdev_lock); + + spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); + + /* Random leaf vdev */ + vdev_t *rand_vd = ztest_random_concrete_vdev_leaf(spa->spa_root_vdev); + if (rand_vd == NULL) { + spa_config_exit(spa, SCL_VDEV, FTAG); + mutex_exit(&ztest_vdev_lock); + return; + } + + /* + * The random vdev we've selected may change as soon as we + * drop the spa_config_lock. We create local copies of things + * we're interested in. + */ + uint64_t guid = rand_vd->vdev_guid; + char *path = strdup(rand_vd->vdev_path); + boolean_t active = rand_vd->vdev_initialize_thread != NULL; + + zfs_dbgmsg("vd %p, guid %llu", rand_vd, guid); + spa_config_exit(spa, SCL_VDEV, FTAG); + + uint64_t cmd = ztest_random(POOL_INITIALIZE_FUNCS); + error = spa_vdev_initialize(spa, guid, cmd); + switch (cmd) { + case POOL_INITIALIZE_CANCEL: + if (ztest_opts.zo_verbose >= 4) { + (void) printf("Cancel initialize %s", path); + if (!active) + (void) printf(" failed (no initialize active)"); + (void) printf("\n"); + } + break; + case POOL_INITIALIZE_DO: + if (ztest_opts.zo_verbose >= 4) { + (void) printf("Start initialize %s", path); + if (active && error == 0) + (void) printf(" failed (already active)"); + else if (error != 0) + (void) printf(" failed (error %d)", error); + (void) printf("\n"); + } + break; + case POOL_INITIALIZE_SUSPEND: + if (ztest_opts.zo_verbose >= 4) { + (void) printf("Suspend initialize %s", path); + if (!active) + (void) printf(" failed (no initialize active)"); + (void) printf("\n"); + } + break; + } + free(path); + mutex_exit(&ztest_vdev_lock); } /* Modified: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h ============================================================================== --- head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h Tue Jul 31 21:06:04 2018 (r337007) @@ -137,6 +137,9 @@ typedef enum zfs_error { EZFS_NO_CHECKPOINT, /* pool has no checkpoint */ EZFS_DEVRM_IN_PROGRESS, /* a device is currently being removed */ EZFS_VDEV_TOO_BIG, /* a device is too big to be used */ + EZFS_TOOMANY, /* argument list too long */ + EZFS_INITIALIZING, /* currently initializing */ + EZFS_NO_INITIALIZE, /* no active initialize */ EZFS_UNKNOWN } zfs_error_t; @@ -262,6 +265,8 @@ typedef struct splitflags { * Functions to manipulate pool and vdev state */ extern int zpool_scan(zpool_handle_t *, pool_scan_func_t, pool_scrub_cmd_t); +extern int zpool_initialize(zpool_handle_t *, pool_initialize_func_t, + nvlist_t *); extern int zpool_clear(zpool_handle_t *, const char *, nvlist_t *); extern int zpool_reguid(zpool_handle_t *); extern int zpool_reopen(zpool_handle_t *); Modified: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c ============================================================================== --- head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c Tue Jul 31 21:06:04 2018 (r337007) @@ -1981,6 +1981,100 @@ zpool_scan(zpool_handle_t *zhp, pool_scan_func_t func, } } +static int +xlate_init_err(int err) +{ + switch (err) { + case ENODEV: + return (EZFS_NODEVICE); + case EINVAL: + case EROFS: + return (EZFS_BADDEV); + case EBUSY: + return (EZFS_INITIALIZING); + case ESRCH: + return (EZFS_NO_INITIALIZE); + } + return (err); +} + +/* + * Begin, suspend, or cancel the initialization (initializing of all free + * blocks) for the given vdevs in the given pool. + */ +int +zpool_initialize(zpool_handle_t *zhp, pool_initialize_func_t cmd_type, + nvlist_t *vds) +{ + char msg[1024]; + libzfs_handle_t *hdl = zhp->zpool_hdl; + + nvlist_t *errlist; + + /* translate vdev names to guids */ + nvlist_t *vdev_guids = fnvlist_alloc(); + nvlist_t *guids_to_paths = fnvlist_alloc(); + boolean_t spare, cache; + nvlist_t *tgt; + nvpair_t *elem; + + for (elem = nvlist_next_nvpair(vds, NULL); elem != NULL; + elem = nvlist_next_nvpair(vds, elem)) { + char *vd_path = nvpair_name(elem); + tgt = zpool_find_vdev(zhp, vd_path, &spare, &cache, NULL); + + if ((tgt == NULL) || cache || spare) { + (void) snprintf(msg, sizeof (msg), + dgettext(TEXT_DOMAIN, "cannot initialize '%s'"), + vd_path); + int err = (tgt == NULL) ? EZFS_NODEVICE : + (spare ? EZFS_ISSPARE : EZFS_ISL2CACHE); + fnvlist_free(vdev_guids); + fnvlist_free(guids_to_paths); + return (zfs_error(hdl, err, msg)); + } + + uint64_t guid = fnvlist_lookup_uint64(tgt, ZPOOL_CONFIG_GUID); + fnvlist_add_uint64(vdev_guids, vd_path, guid); + + (void) snprintf(msg, sizeof (msg), "%llu", guid); + fnvlist_add_string(guids_to_paths, msg, vd_path); + } + + int err = lzc_initialize(zhp->zpool_name, cmd_type, vdev_guids, + &errlist); + fnvlist_free(vdev_guids); + + if (err == 0) { + fnvlist_free(guids_to_paths); + return (0); + } + + nvlist_t *vd_errlist = NULL; + if (errlist != NULL) { + vd_errlist = fnvlist_lookup_nvlist(errlist, + ZPOOL_INITIALIZE_VDEVS); + } + + (void) snprintf(msg, sizeof (msg), + dgettext(TEXT_DOMAIN, "operation failed")); + + for (elem = nvlist_next_nvpair(vd_errlist, NULL); elem != NULL; + elem = nvlist_next_nvpair(vd_errlist, elem)) { + int64_t vd_error = xlate_init_err(fnvpair_value_int64(elem)); + char *path = fnvlist_lookup_string(guids_to_paths, + nvpair_name(elem)); + (void) zfs_error_fmt(hdl, vd_error, "cannot initialize '%s'", + path); + } + + fnvlist_free(guids_to_paths); + if (vd_errlist != NULL) + return (-1); + + return (zpool_standard_error(hdl, err, msg)); +} + #ifdef illumos /* * This provides a very minimal check whether a given string is likely a Modified: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c ============================================================================== --- head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c Tue Jul 31 21:06:04 2018 (r337007) @@ -254,6 +254,13 @@ libzfs_error_description(libzfs_handle_t *hdl) return (dgettext(TEXT_DOMAIN, "device removal in progress")); case EZFS_VDEV_TOO_BIG: return (dgettext(TEXT_DOMAIN, "device exceeds supported size")); + case EZFS_TOOMANY: + return (dgettext(TEXT_DOMAIN, "argument list too long")); + case EZFS_INITIALIZING: + return (dgettext(TEXT_DOMAIN, "currently initializing")); + case EZFS_NO_INITIALIZE: + return (dgettext(TEXT_DOMAIN, "there is no active " + "initialization")); case EZFS_UNKNOWN: return (dgettext(TEXT_DOMAIN, "unknown error")); default: Modified: head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c ============================================================================== --- head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c Tue Jul 31 21:06:04 2018 (r337007) @@ -1085,3 +1085,40 @@ lzc_channel_program_nosync(const char *pool, const cha return (lzc_channel_program_impl(pool, program, B_FALSE, timeout, memlimit, argnvl, outnvl)); } + +/* + * Changes initializing state. + * + * vdevs should be a list of (<key>, guid) where guid is a uint64 vdev GUID. + * The key is ignored. + * + * If there are errors related to vdev arguments, per-vdev errors are returned + * in an nvlist with the key "vdevs". Each error is a (guid, errno) pair where + * guid is stringified with PRIu64, and errno is one of the following as + * an int64_t: + * - ENODEV if the device was not found + * - EINVAL if the devices is not a leaf or is not concrete (e.g. missing) + * - EROFS if the device is not writeable + * - EBUSY start requested but the device is already being initialized + * - ESRCH cancel/suspend requested but device is not being initialized + * + * If the errlist is empty, then return value will be: + * - EINVAL if one or more arguments was invalid + * - Other spa_open failures + * - 0 if the operation succeeded + */ +int +lzc_initialize(const char *poolname, pool_initialize_func_t cmd_type, + nvlist_t *vdevs, nvlist_t **errlist) +{ + int error; + nvlist_t *args = fnvlist_alloc(); + fnvlist_add_uint64(args, ZPOOL_INITIALIZE_COMMAND, (uint64_t)cmd_type); + fnvlist_add_nvlist(args, ZPOOL_INITIALIZE_VDEVS, vdevs); + + error = lzc_ioctl(ZFS_IOC_POOL_INITIALIZE, poolname, args, errlist); + + fnvlist_free(args); + + return (error); +} Modified: head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h ============================================================================== --- head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h Tue Jul 31 21:02:45 2018 (r337006) +++ head/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h Tue Jul 31 21:06:04 2018 (r337007) @@ -31,7 +31,9 @@ #include <libnvpair.h> #include <sys/param.h> #include <sys/types.h> +#include <sys/fs/zfs.h> + #ifdef __cplusplus extern "C" { #endif @@ -56,6 +58,8 @@ int lzc_destroy_snaps(nvlist_t *, boolean_t, nvlist_t int lzc_bookmark(nvlist_t *, nvlist_t **); int lzc_get_bookmarks(const char *, nvlist_t *, nvlist_t **); int lzc_destroy_bookmarks(nvlist_t *, nvlist_t **); +int lzc_initialize(const char *, pool_initialize_func_t, nvlist_t *, + nvlist_t **); int lzc_snaprange_space(const char *, const char *, uint64_t *); Modified: head/sys/cddl/contrib/opensolaris/uts/common/Makefile.files ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/Makefile.files Tue Jul 31 21:02:45 2018 (r337006) +++ head/sys/cddl/contrib/opensolaris/uts/common/Makefile.files Tue Jul 31 21:06:04 2018 (r337007) @@ -124,6 +124,7 @@ ZFS_COMMON_OBJS += \ vdev_indirect.o \ vdev_indirect_births.o \ vdev_indirect_mapping.o \ + vdev_initialize.o \ vdev_label.o \ vdev_mirror.o \ vdev_missing.o \ Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c Tue Jul 31 21:06:04 2018 (r337007) @@ -725,6 +725,8 @@ metaslab_group_create(metaslab_class_t *mc, vdev_t *vd mg = kmem_zalloc(sizeof (metaslab_group_t), KM_SLEEP); mutex_init(&mg->mg_lock, NULL, MUTEX_DEFAULT, NULL); + mutex_init(&mg->mg_ms_initialize_lock, NULL, MUTEX_DEFAULT, NULL); + cv_init(&mg->mg_ms_initialize_cv, NULL, CV_DEFAULT, NULL); mg->mg_primaries = kmem_zalloc(allocators * sizeof (metaslab_t *), KM_SLEEP); mg->mg_secondaries = kmem_zalloc(allocators * sizeof (metaslab_t *), @@ -771,6 +773,8 @@ metaslab_group_destroy(metaslab_group_t *mg) kmem_free(mg->mg_secondaries, mg->mg_allocators * sizeof (metaslab_t *)); mutex_destroy(&mg->mg_lock); + mutex_destroy(&mg->mg_ms_initialize_lock); + cv_destroy(&mg->mg_ms_initialize_cv); for (int i = 0; i < mg->mg_allocators; i++) { refcount_destroy(&mg->mg_alloc_queue_depth[i]); @@ -1554,6 +1558,7 @@ metaslab_init(metaslab_group_t *mg, uint64_t id, uint6 mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL); + ms->ms_id = id; ms->ms_start = id << vd->vdev_ms_shift; ms->ms_size = 1ULL << vd->vdev_ms_shift; @@ -2731,6 +2736,7 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) * from it in 'metaslab_unload_delay' txgs, then unload it. */ if (msp->ms_loaded && + msp->ms_initializing == 0 && msp->ms_selected_txg + metaslab_unload_delay < txg) { for (int t = 1; t < TXG_CONCURRENT_STATES; t++) { VERIFY0(range_tree_space( @@ -2980,6 +2986,7 @@ metaslab_block_alloc(metaslab_t *msp, uint64_t size, u metaslab_class_t *mc = msp->ms_group->mg_class; VERIFY(!msp->ms_condensing); + VERIFY0(msp->ms_initializing); start = mc->mc_ops->msop_alloc(msp, size); if (start != -1ULL) { @@ -3040,9 +3047,10 @@ find_valid_metaslab(metaslab_group_t *mg, uint64_t act } /* - * If the selected metaslab is condensing, skip it. + * If the selected metaslab is condensing or being + * initialized, skip it. */ - if (msp->ms_condensing) + if (msp->ms_condensing || msp->ms_initializing > 0) continue; *was_active = msp->ms_allocator != -1; @@ -3207,11 +3215,20 @@ metaslab_group_alloc_normal(metaslab_group_t *mg, zio_ /* * If this metaslab is currently condensing then pick again as * we can't manipulate this metaslab until it's committed - * to disk. + * to disk. If this metaslab is being initialized, we shouldn't + * allocate from it since the allocated region might be + * overwritten after allocation. */ if (msp->ms_condensing) { metaslab_trace_add(zal, mg, msp, asize, d, TRACE_CONDENSING, allocator); + metaslab_passivate(msp, msp->ms_weight & + ~METASLAB_ACTIVE_MASK); + mutex_exit(&msp->ms_lock); + continue; + } else if (msp->ms_initializing > 0) { + metaslab_trace_add(zal, mg, msp, asize, d, + TRACE_INITIALIZING, allocator); metaslab_passivate(msp, msp->ms_weight & ~METASLAB_ACTIVE_MASK); mutex_exit(&msp->ms_lock); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Tue Jul 31 21:06:04 2018 (r337007) @@ -55,6 +55,7 @@ #include <sys/vdev_removal.h> #include <sys/vdev_indirect_mapping.h> #include <sys/vdev_indirect_births.h> +#include <sys/vdev_initialize.h> #include <sys/metaslab.h> #include <sys/metaslab_impl.h> #include <sys/uberblock_impl.h> @@ -443,8 +444,9 @@ spa_prop_get(spa_t *spa, nvlist_t **nvp) dp = spa_get_dsl(spa); dsl_pool_config_enter(dp, FTAG); - if (err = dsl_dataset_hold_obj(dp, - za.za_first_integer, FTAG, &ds)) { + err = dsl_dataset_hold_obj(dp, + za.za_first_integer, FTAG, &ds); + if (err != 0) { dsl_pool_config_exit(dp, FTAG); break; } @@ -599,7 +601,8 @@ spa_prop_validate(spa_t *spa, nvlist_t *props) break; } - if (error = dmu_objset_hold(strval, FTAG, &os)) + error = dmu_objset_hold(strval, FTAG, &os); + if (error != 0) break; /* @@ -1215,8 +1218,10 @@ spa_activate(spa_t *spa, int mode) */ trim_thread_create(spa); - for (size_t i = 0; i < TXG_SIZE; i++) - spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, 0); + for (size_t i = 0; i < TXG_SIZE; i++) { + spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, + ZIO_FLAG_CANFAIL); + } list_create(&spa->spa_config_dirty_list, sizeof (vdev_t), offsetof(vdev_t, vdev_config_dirty_node)); @@ -1388,6 +1393,11 @@ spa_unload(spa_t *spa) */ spa_async_suspend(spa); + if (spa->spa_root_vdev) { + vdev_initialize_stop_all(spa->spa_root_vdev, + VDEV_INITIALIZE_ACTIVE); + } + /* * Stop syncing. */ @@ -1403,10 +1413,10 @@ spa_unload(spa_t *spa) * calling taskq_wait(mg_taskq). */ if (spa->spa_root_vdev != NULL) { - spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); + spa_config_enter(spa, SCL_ALL, spa, RW_WRITER); for (int c = 0; c < spa->spa_root_vdev->vdev_children; c++) vdev_metaslab_fini(spa->spa_root_vdev->vdev_child[c]); - spa_config_exit(spa, SCL_ALL, FTAG); + spa_config_exit(spa, SCL_ALL, spa); } /* @@ -1440,7 +1450,7 @@ spa_unload(spa_t *spa) bpobj_close(&spa->spa_deferred_bpobj); - spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); + spa_config_enter(spa, SCL_ALL, spa, RW_WRITER); /* * Close all vdevs. @@ -1502,7 +1512,7 @@ spa_unload(spa_t *spa) spa->spa_comment = NULL; } - spa_config_exit(spa, SCL_ALL, FTAG); + spa_config_exit(spa, SCL_ALL, spa); } /* @@ -3954,6 +3964,10 @@ spa_load_impl(spa_t *spa, spa_import_type_t type, char spa_restart_removal(spa); spa_spawn_aux_threads(spa); + + spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); + vdev_initialize_restart(spa->spa_root_vdev); + spa_config_exit(spa, SCL_CONFIG, FTAG); } spa_load_note(spa, "LOADED"); @@ -5675,6 +5689,7 @@ spa_export_common(char *pool, int new_state, nvlist_t * in which case we can modify its state. */ if (spa->spa_state != POOL_STATE_UNINITIALIZED && spa->spa_sync_on) { + /* * Objsets may be open only because they're dirty, so we * have to force it to sync before checking spa_refcnt. @@ -5709,6 +5724,18 @@ spa_export_common(char *pool, int new_state, nvlist_t } /* + * We're about to export or destroy this pool. Make sure + * we stop all initializtion activity here before we + * set the spa_final_txg. This will ensure that all + * dirty data resulting from the initialization is + * committed to disk before we unload the pool. + */ + if (spa->spa_root_vdev != NULL) { + vdev_initialize_stop_all(spa->spa_root_vdev, + VDEV_INITIALIZE_ACTIVE); + } + + /* * We want this to be reflected on every label, * so mark them all dirty. spa_unload() will do the * final sync that pushes these changes out. @@ -6398,6 +6425,86 @@ spa_vdev_detach(spa_t *spa, uint64_t guid, uint64_t pg return (error); } +int +spa_vdev_initialize(spa_t *spa, uint64_t guid, uint64_t cmd_type) +{ + /* + * We hold the namespace lock through the whole function + * to prevent any changes to the pool while we're starting or + * stopping initialization. The config and state locks are held so that + * we can properly assess the vdev state before we commit to + * the initializing operation. + */ + mutex_enter(&spa_namespace_lock); + spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER); + + /* Look up vdev and ensure it's a leaf. */ + vdev_t *vd = spa_lookup_by_guid(spa, guid, B_FALSE); + if (vd == NULL || vd->vdev_detached) { + spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG); + mutex_exit(&spa_namespace_lock); + return (SET_ERROR(ENODEV)); + } else if (!vd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(vd)) { + spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG); + mutex_exit(&spa_namespace_lock); + return (SET_ERROR(EINVAL)); + } else if (!vdev_writeable(vd)) { + spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG); + mutex_exit(&spa_namespace_lock); + return (SET_ERROR(EROFS)); + } + mutex_enter(&vd->vdev_initialize_lock); + spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG); + + /* + * When we activate an initialize action we check to see + * if the vdev_initialize_thread is NULL. We do this instead + * of using the vdev_initialize_state since there might be + * a previous initialization process which has completed but + * the thread is not exited. + */ + if (cmd_type == POOL_INITIALIZE_DO && + (vd->vdev_initialize_thread != NULL || + vd->vdev_top->vdev_removing)) { + mutex_exit(&vd->vdev_initialize_lock); + mutex_exit(&spa_namespace_lock); + return (SET_ERROR(EBUSY)); + } else if (cmd_type == POOL_INITIALIZE_CANCEL && + (vd->vdev_initialize_state != VDEV_INITIALIZE_ACTIVE && + vd->vdev_initialize_state != VDEV_INITIALIZE_SUSPENDED)) { + mutex_exit(&vd->vdev_initialize_lock); + mutex_exit(&spa_namespace_lock); + return (SET_ERROR(ESRCH)); + } else if (cmd_type == POOL_INITIALIZE_SUSPEND && + vd->vdev_initialize_state != VDEV_INITIALIZE_ACTIVE) { + mutex_exit(&vd->vdev_initialize_lock); + mutex_exit(&spa_namespace_lock); + return (SET_ERROR(ESRCH)); + } + + switch (cmd_type) { + case POOL_INITIALIZE_DO: + vdev_initialize(vd); + break; + case POOL_INITIALIZE_CANCEL: + vdev_initialize_stop(vd, VDEV_INITIALIZE_CANCELED); + break; + case POOL_INITIALIZE_SUSPEND: + vdev_initialize_stop(vd, VDEV_INITIALIZE_SUSPENDED); + break; + default: + panic("invalid cmd_type %llu", (unsigned long long)cmd_type); + } + mutex_exit(&vd->vdev_initialize_lock); + + /* Sync out the initializing state */ + txg_wait_synced(spa->spa_dsl_pool, 0); + mutex_exit(&spa_namespace_lock); + + return (0); +} + + /* * Split a set of devices from their mirrors, and create a new pool from them. */ @@ -6605,6 +6712,19 @@ spa_vdev_split_mirror(spa_t *spa, char *newname, nvlis spa_activate(newspa, spa_mode_global); spa_async_suspend(newspa); + for (c = 0; c < children; c++) { + if (vml[c] != NULL) { + /* + * Temporarily stop the initializing activity. We set + * the state to ACTIVE so that we know to resume + * the initializing once the split has completed. + */ + mutex_enter(&vml[c]->vdev_initialize_lock); + vdev_initialize_stop(vml[c], VDEV_INITIALIZE_ACTIVE); + mutex_exit(&vml[c]->vdev_initialize_lock); + } + } + #ifndef illumos /* mark that we are creating new spa by splitting */ newspa->spa_splitting_newspa = B_TRUE; @@ -6699,6 +6819,10 @@ out: if (vml[c] != NULL) vml[c]->vdev_offline = B_FALSE; } + + /* restart initializing disks as necessary */ + spa_async_request(spa, SPA_ASYNC_INITIALIZE_RESTART); + vdev_reopen(spa->spa_root_vdev); nvlist_free(spa->spa_config_splitting); @@ -7063,6 +7187,14 @@ spa_async_thread(void *arg) if (tasks & SPA_ASYNC_RESILVER) dsl_resilver_restart(spa->spa_dsl_pool, 0); + if (tasks & SPA_ASYNC_INITIALIZE_RESTART) { + mutex_enter(&spa_namespace_lock); + spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); + vdev_initialize_restart(spa->spa_root_vdev); + spa_config_exit(spa, SCL_CONFIG, FTAG); + mutex_exit(&spa_namespace_lock); + } + /* * Let the world know that we're done. */ @@ -7762,8 +7894,9 @@ spa_sync(spa_t *spa, uint64_t txg) * Wait for i/os issued in open context that need to complete * before this txg syncs. */ - VERIFY0(zio_wait(spa->spa_txg_zio[txg & TXG_MASK])); - spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, 0); + (void) zio_wait(spa->spa_txg_zio[txg & TXG_MASK]); + spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, + ZIO_FLAG_CANFAIL); /* * Lock out configuration changes. @@ -8065,7 +8198,8 @@ spa_sync(spa_t *spa, uint64_t txg) /* * Update usable space statistics. */ - while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg))) + while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg))) + != NULL) vdev_sync_done(vd, txg); spa_update_dspace(spa); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c Tue Jul 31 21:02:45 2018 (r337006) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c Tue Jul 31 21:06:04 2018 (r337007) @@ -41,6 +41,7 @@ #include <sys/zil.h> #include <sys/vdev_impl.h> #include <sys/vdev_file.h> +#include <sys/vdev_initialize.h> #include <sys/metaslab.h> #include <sys/uberblock_impl.h> #include <sys/txg.h> @@ -1313,6 +1314,12 @@ spa_vdev_config_exit(spa_t *spa, vdev_t *vd, uint64_t if (vd != NULL) { ASSERT(!vd->vdev_detached || vd->vdev_dtl_sm == NULL); + if (vd->vdev_ops->vdev_op_leaf) { + mutex_enter(&vd->vdev_initialize_lock); + vdev_initialize_stop(vd, VDEV_INITIALIZE_CANCELED); + mutex_exit(&vd->vdev_initialize_lock); + } + spa_config_enter(spa, SCL_ALL, spa, RW_WRITER); vdev_free(vd); spa_config_exit(spa, SCL_ALL, spa); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Tue Jul 31 21:02:45 2018 (r337006) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Tue Jul 31 21:06:04 2018 (r337007) @@ -68,7 +68,8 @@ typedef enum trace_alloc_type { TRACE_GROUP_FAILURE = -5ULL, TRACE_ENOSPC = -6ULL, TRACE_CONDENSING = -7ULL, - TRACE_VDEV_ERROR = -8ULL + TRACE_VDEV_ERROR = -8ULL, + TRACE_INITIALIZING = -9ULL } trace_alloc_type_t; #define METASLAB_WEIGHT_PRIMARY (1ULL << 63) @@ -271,6 +272,11 @@ struct metaslab_group { uint64_t mg_failed_allocations; uint64_t mg_fragmentation; uint64_t mg_histogram[RANGE_TREE_HISTOGRAM_SIZE]; + + int mg_ms_initializing; + boolean_t mg_initialize_updating; + kmutex_t mg_ms_initialize_lock; + kcondvar_t mg_ms_initialize_cv; }; /* @@ -360,6 +366,8 @@ struct metaslab { boolean_t ms_condensing; /* condensing? */ boolean_t ms_condense_wanted; uint64_t ms_condense_checked_txg; + + uint64_t ms_initializing; /* leaves initializing this ms */ *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201807312106.w6VL65J6038222>