From owner-svn-src-vendor@freebsd.org Wed Nov 6 09:00:07 2019 Return-Path: Delivered-To: svn-src-vendor@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8C7311AF138; Wed, 6 Nov 2019 09:00:07 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 477L8R3x5Yz3KWf; Wed, 6 Nov 2019 09:00:07 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 6A078229DD; Wed, 6 Nov 2019 09:00:07 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id xA6907An047381; Wed, 6 Nov 2019 09:00:07 GMT (envelope-from avg@FreeBSD.org) Received: (from avg@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id xA69067g047376; Wed, 6 Nov 2019 09:00:06 GMT (envelope-from avg@FreeBSD.org) Message-Id: <201911060900.xA69067g047376@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: avg set sender to avg@FreeBSD.org using -f From: Andriy Gapon Date: Wed, 6 Nov 2019 09:00:06 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-vendor@freebsd.org Subject: svn commit: r354383 - vendor-sys/illumos/dist/uts/common/fs/zfs vendor-sys/illumos/dist/uts/common/fs/zfs/sys vendor/illumos/dist/cmd/zdb vendor/illumos/dist/man/man1m X-SVN-Group: vendor-sys X-SVN-Commit-Author: avg X-SVN-Commit-Paths: vendor-sys/illumos/dist/uts/common/fs/zfs vendor-sys/illumos/dist/uts/common/fs/zfs/sys vendor/illumos/dist/cmd/zdb vendor/illumos/dist/man/man1m X-SVN-Commit-Revision: 354383 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-vendor@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for the vendor work area tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Nov 2019 09:00:07 -0000 Author: avg Date: Wed Nov 6 09:00:06 2019 New Revision: 354383 URL: https://svnweb.freebsd.org/changeset/base/354383 Log: 10592 misc. metaslab and vdev related ZoL bug fixes illumos/illumos-gate@555d674d5d4b8191dc83723188349d28278b2431 https://github.com/illumos/illumos-gate/commit/555d674d5d4b8191dc83723188349d28278b2431 https://www.illumos.org/issues/10592 This is a collection of recent fixes from ZoL: 8eef997679b Error path in metaslab_load_impl() forgets to drop ms_sync_lock 928e8ad47d3 Introduce auxiliary metaslab histograms 425d3237ee8 Get rid of space_map_update() for ms_synced_length 6c926f426a2 Simplify log vdev removal code 21e7cf5da89 zdb -L should skip leak detection altogether df72b8bebe0 Rename range_tree_verify to range_tree_verify_not_present 75058f33034 Remove unused vdev_t fields Portions contributed by: Jerry Jelinek Author: Serapheim Dimitropoulos Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c vendor-sys/illumos/dist/uts/common/fs/zfs/range_tree.c vendor-sys/illumos/dist/uts/common/fs/zfs/spa_checkpoint.c vendor-sys/illumos/dist/uts/common/fs/zfs/space_map.c vendor-sys/illumos/dist/uts/common/fs/zfs/sys/metaslab.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/metaslab_impl.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/range_tree.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/space_map.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/vdev_impl.h vendor-sys/illumos/dist/uts/common/fs/zfs/vdev.c vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_indirect.c vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_indirect_mapping.c vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_initialize.c vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_removal.c Changes in other areas also in this revision: Modified: vendor/illumos/dist/cmd/zdb/zdb.c vendor/illumos/dist/man/man1m/zdb.1m Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c Wed Nov 6 08:58:03 2019 (r354382) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c Wed Nov 6 09:00:06 2019 (r354383) @@ -489,45 +489,62 @@ metaslab_compare(const void *x1, const void *x2) return (AVL_CMP(m1->ms_start, m2->ms_start)); } +uint64_t +metaslab_allocated_space(metaslab_t *msp) +{ + return (msp->ms_allocated_space); +} + /* * Verify that the space accounting on disk matches the in-core range_trees. */ -void +static void metaslab_verify_space(metaslab_t *msp, uint64_t txg) { spa_t *spa = msp->ms_group->mg_vd->vdev_spa; - uint64_t allocated = 0; + uint64_t allocating = 0; uint64_t sm_free_space, msp_free_space; ASSERT(MUTEX_HELD(&msp->ms_lock)); + ASSERT(!msp->ms_condensing); if ((zfs_flags & ZFS_DEBUG_METASLAB_VERIFY) == 0) return; /* * We can only verify the metaslab space when we're called - * from syncing context with a loaded metaslab that has an allocated - * space map. Calling this in non-syncing context does not - * provide a consistent view of the metaslab since we're performing - * allocations in the future. + * from syncing context with a loaded metaslab that has an + * allocated space map. Calling this in non-syncing context + * does not provide a consistent view of the metaslab since + * we're performing allocations in the future. */ if (txg != spa_syncing_txg(spa) || msp->ms_sm == NULL || !msp->ms_loaded) return; - sm_free_space = msp->ms_size - space_map_allocated(msp->ms_sm) - - space_map_alloc_delta(msp->ms_sm); + /* + * Even though the smp_alloc field can get negative (e.g. + * see vdev_checkpoint_sm), that should never be the case + * when it come's to a metaslab's space map. + */ + ASSERT3S(space_map_allocated(msp->ms_sm), >=, 0); + sm_free_space = msp->ms_size - metaslab_allocated_space(msp); + /* - * Account for future allocations since we would have already - * deducted that space from the ms_freetree. + * Account for future allocations since we would have + * already deducted that space from the ms_allocatable. */ for (int t = 0; t < TXG_CONCURRENT_STATES; t++) { - allocated += + allocating += range_tree_space(msp->ms_allocating[(txg + t) & TXG_MASK]); } - msp_free_space = range_tree_space(msp->ms_allocatable) + allocated + + ASSERT3U(msp->ms_deferspace, ==, + range_tree_space(msp->ms_defer[0]) + + range_tree_space(msp->ms_defer[1])); + + msp_free_space = range_tree_space(msp->ms_allocatable) + allocating + msp->ms_deferspace + range_tree_space(msp->ms_freed); VERIFY3U(sm_free_space, ==, msp_free_space); @@ -832,6 +849,7 @@ metaslab_group_histogram_verify(metaslab_group_t *mg) for (int m = 0; m < vd->vdev_ms_count; m++) { metaslab_t *msp = vd->vdev_ms[m]; + ASSERT(msp != NULL); /* skip if not active or not a member */ if (msp->ms_sm == NULL || msp->ms_group != mg) @@ -1445,7 +1463,204 @@ metaslab_ops_t *zfs_metaslab_ops = &metaslab_df_ops; * ========================================================================== */ +static void +metaslab_aux_histograms_clear(metaslab_t *msp) +{ + /* + * Auxiliary histograms are only cleared when resetting them, + * which can only happen while the metaslab is loaded. + */ + ASSERT(msp->ms_loaded); + + bzero(msp->ms_synchist, sizeof (msp->ms_synchist)); + for (int t = 0; t < TXG_DEFER_SIZE; t++) + bzero(msp->ms_deferhist[t], sizeof (msp->ms_deferhist[t])); +} + +static void +metaslab_aux_histogram_add(uint64_t *histogram, uint64_t shift, + range_tree_t *rt) +{ + /* + * This is modeled after space_map_histogram_add(), so refer to that + * function for implementation details. We want this to work like + * the space map histogram, and not the range tree histogram, as we + * are essentially constructing a delta that will be later subtracted + * from the space map histogram. + */ + int idx = 0; + for (int i = shift; i < RANGE_TREE_HISTOGRAM_SIZE; i++) { + ASSERT3U(i, >=, idx + shift); + histogram[idx] += rt->rt_histogram[i] << (i - idx - shift); + + if (idx < SPACE_MAP_HISTOGRAM_SIZE - 1) { + ASSERT3U(idx + shift, ==, i); + idx++; + ASSERT3U(idx, <, SPACE_MAP_HISTOGRAM_SIZE); + } + } +} + /* + * Called at every sync pass that the metaslab gets synced. + * + * The reason is that we want our auxiliary histograms to be updated + * wherever the metaslab's space map histogram is updated. This way + * we stay consistent on which parts of the metaslab space map's + * histogram are currently not available for allocations (e.g because + * they are in the defer, freed, and freeing trees). + */ +static void +metaslab_aux_histograms_update(metaslab_t *msp) +{ + space_map_t *sm = msp->ms_sm; + ASSERT(sm != NULL); + + /* + * This is similar to the metaslab's space map histogram updates + * that take place in metaslab_sync(). The only difference is that + * we only care about segments that haven't made it into the + * ms_allocatable tree yet. + */ + if (msp->ms_loaded) { + metaslab_aux_histograms_clear(msp); + + metaslab_aux_histogram_add(msp->ms_synchist, + sm->sm_shift, msp->ms_freed); + + for (int t = 0; t < TXG_DEFER_SIZE; t++) { + metaslab_aux_histogram_add(msp->ms_deferhist[t], + sm->sm_shift, msp->ms_defer[t]); + } + } + + metaslab_aux_histogram_add(msp->ms_synchist, + sm->sm_shift, msp->ms_freeing); +} + +/* + * Called every time we are done syncing (writing to) the metaslab, + * i.e. at the end of each sync pass. + * [see the comment in metaslab_impl.h for ms_synchist, ms_deferhist] + */ +static void +metaslab_aux_histograms_update_done(metaslab_t *msp, boolean_t defer_allowed) +{ + spa_t *spa = msp->ms_group->mg_vd->vdev_spa; + space_map_t *sm = msp->ms_sm; + + if (sm == NULL) { + /* + * We came here from metaslab_init() when creating/opening a + * pool, looking at a metaslab that hasn't had any allocations + * yet. + */ + return; + } + + /* + * This is similar to the actions that we take for the ms_freed + * and ms_defer trees in metaslab_sync_done(). + */ + uint64_t hist_index = spa_syncing_txg(spa) % TXG_DEFER_SIZE; + if (defer_allowed) { + bcopy(msp->ms_synchist, msp->ms_deferhist[hist_index], + sizeof (msp->ms_synchist)); + } else { + bzero(msp->ms_deferhist[hist_index], + sizeof (msp->ms_deferhist[hist_index])); + } + bzero(msp->ms_synchist, sizeof (msp->ms_synchist)); +} + +/* + * Ensure that the metaslab's weight and fragmentation are consistent + * with the contents of the histogram (either the range tree's histogram + * or the space map's depending whether the metaslab is loaded). + */ +static void +metaslab_verify_weight_and_frag(metaslab_t *msp) +{ + ASSERT(MUTEX_HELD(&msp->ms_lock)); + + if ((zfs_flags & ZFS_DEBUG_METASLAB_VERIFY) == 0) + return; + + /* see comment in metaslab_verify_unflushed_changes() */ + if (msp->ms_group == NULL) + return; + + /* + * Devices being removed always return a weight of 0 and leave + * fragmentation and ms_max_size as is - there is nothing for + * us to verify here. + */ + vdev_t *vd = msp->ms_group->mg_vd; + if (vd->vdev_removing) + return; + + /* + * If the metaslab is dirty it probably means that we've done + * some allocations or frees that have changed our histograms + * and thus the weight. + */ + for (int t = 0; t < TXG_SIZE; t++) { + if (txg_list_member(&vd->vdev_ms_list, msp, t)) + return; + } + + /* + * This verification checks that our in-memory state is consistent + * with what's on disk. If the pool is read-only then there aren't + * any changes and we just have the initially-loaded state. + */ + if (!spa_writeable(msp->ms_group->mg_vd->vdev_spa)) + return; + + /* some extra verification for in-core tree if you can */ + if (msp->ms_loaded) { + range_tree_stat_verify(msp->ms_allocatable); + VERIFY(space_map_histogram_verify(msp->ms_sm, + msp->ms_allocatable)); + } + + uint64_t weight = msp->ms_weight; + uint64_t was_active = msp->ms_weight & METASLAB_ACTIVE_MASK; + boolean_t space_based = WEIGHT_IS_SPACEBASED(msp->ms_weight); + uint64_t frag = msp->ms_fragmentation; + uint64_t max_segsize = msp->ms_max_size; + + msp->ms_weight = 0; + msp->ms_fragmentation = 0; + msp->ms_max_size = 0; + + /* + * This function is used for verification purposes. Regardless of + * whether metaslab_weight() thinks this metaslab should be active or + * not, we want to ensure that the actual weight (and therefore the + * value of ms_weight) would be the same if it was to be recalculated + * at this point. + */ + msp->ms_weight = metaslab_weight(msp) | was_active; + + VERIFY3U(max_segsize, ==, msp->ms_max_size); + + /* + * If the weight type changed then there is no point in doing + * verification. Revert fields to their original values. + */ + if ((space_based && !WEIGHT_IS_SPACEBASED(msp->ms_weight)) || + (!space_based && WEIGHT_IS_SPACEBASED(msp->ms_weight))) { + msp->ms_fragmentation = frag; + msp->ms_weight = weight; + return; + } + + VERIFY3U(msp->ms_fragmentation, ==, frag); + VERIFY3U(msp->ms_weight, ==, weight); +} + +/* * Wait for any in-progress metaslab loads to complete. */ static void @@ -1466,47 +1681,94 @@ metaslab_load_impl(metaslab_t *msp) ASSERT(MUTEX_HELD(&msp->ms_lock)); ASSERT(msp->ms_loading); + ASSERT(!msp->ms_condensing); /* - * Nobody else can manipulate a loading metaslab, so it's now safe - * to drop the lock. This way we don't have to hold the lock while - * reading the spacemap from disk. + * We temporarily drop the lock to unblock other operations while we + * are reading the space map. Therefore, metaslab_sync() and + * metaslab_sync_done() can run at the same time as we do. + * + * metaslab_sync() can append to the space map while we are loading. + * Therefore we load only entries that existed when we started the + * load. Additionally, metaslab_sync_done() has to wait for the load + * to complete because there are potential races like metaslab_load() + * loading parts of the space map that are currently being appended + * by metaslab_sync(). If we didn't, the ms_allocatable would have + * entries that metaslab_sync_done() would try to re-add later. + * + * That's why before dropping the lock we remember the synced length + * of the metaslab and read up to that point of the space map, + * ignoring entries appended by metaslab_sync() that happen after we + * drop the lock. */ + uint64_t length = msp->ms_synced_length; mutex_exit(&msp->ms_lock); - /* - * If the space map has not been allocated yet, then treat - * all the space in the metaslab as free and add it to ms_allocatable. - */ if (msp->ms_sm != NULL) { - error = space_map_load(msp->ms_sm, msp->ms_allocatable, - SM_FREE); + error = space_map_load_length(msp->ms_sm, msp->ms_allocatable, + SM_FREE, length); } else { + /* + * The space map has not been allocated yet, so treat + * all the space in the metaslab as free and add it to the + * ms_allocatable tree. + */ range_tree_add(msp->ms_allocatable, msp->ms_start, msp->ms_size); } + /* + * We need to grab the ms_sync_lock to prevent metaslab_sync() from + * changing the ms_sm and the metaslab's range trees while we are + * about to use them and populate the ms_allocatable. The ms_lock + * is insufficient for this because metaslab_sync() doesn't hold + * the ms_lock while writing the ms_checkpointing tree to disk. + */ + mutex_enter(&msp->ms_sync_lock); mutex_enter(&msp->ms_lock); + ASSERT(!msp->ms_condensing); - if (error != 0) + if (error != 0) { + mutex_exit(&msp->ms_sync_lock); return (error); + } ASSERT3P(msp->ms_group, !=, NULL); msp->ms_loaded = B_TRUE; /* - * If the metaslab already has a spacemap, then we need to - * remove all segments from the defer tree; otherwise, the - * metaslab is completely empty and we can skip this. + * The ms_allocatable contains the segments that exist in the + * ms_defer trees [see ms_synced_length]. Thus we need to remove + * them from ms_allocatable as they will be added again in + * metaslab_sync_done(). */ - if (msp->ms_sm != NULL) { - for (int t = 0; t < TXG_DEFER_SIZE; t++) { - range_tree_walk(msp->ms_defer[t], - range_tree_remove, msp->ms_allocatable); - } + for (int t = 0; t < TXG_DEFER_SIZE; t++) { + range_tree_walk(msp->ms_defer[t], + range_tree_remove, msp->ms_allocatable); } + + /* + * Call metaslab_recalculate_weight_and_sort() now that the + * metaslab is loaded so we get the metaslab's real weight. + * + * Unless this metaslab was created with older software and + * has not yet been converted to use segment-based weight, we + * expect the new weight to be better or equal to the weight + * that the metaslab had while it was not loaded. This is + * because the old weight does not take into account the + * consolidation of adjacent segments between TXGs. [see + * comment for ms_synchist and ms_deferhist[] for more info] + */ + uint64_t weight = msp->ms_weight; + metaslab_recalculate_weight_and_sort(msp); + if (!WEIGHT_IS_SPACEBASED(weight)) + ASSERT3U(weight, <=, msp->ms_weight); msp->ms_max_size = metaslab_block_maxsize(msp); + spa_t *spa = msp->ms_group->mg_vd->vdev_spa; + metaslab_verify_space(msp, spa_syncing_txg(spa)); + mutex_exit(&msp->ms_sync_lock); + return (0); } @@ -1523,6 +1785,7 @@ metaslab_load(metaslab_t *msp) if (msp->ms_loaded) return (0); VERIFY(!msp->ms_loading); + ASSERT(!msp->ms_condensing); msp->ms_loading = B_TRUE; int error = metaslab_load_impl(msp); @@ -1536,10 +1799,29 @@ void metaslab_unload(metaslab_t *msp) { ASSERT(MUTEX_HELD(&msp->ms_lock)); + + metaslab_verify_weight_and_frag(msp); + range_tree_vacate(msp->ms_allocatable, NULL, NULL); msp->ms_loaded = B_FALSE; + msp->ms_weight &= ~METASLAB_ACTIVE_MASK; msp->ms_max_size = 0; + + /* + * We explicitly recalculate the metaslab's weight based on its space + * map (as it is now not loaded). We want unload metaslabs to always + * have their weights calculated from the space map histograms, while + * loaded ones have it calculated from their in-core range tree + * [see metaslab_load()]. This way, the weight reflects the information + * available in-core, whether it is loaded or not + * + * If ms_group == NULL means that we came here from metaslab_fini(), + * at which point it doesn't make sense for us to do the recalculation + * and the sorting. + */ + if (msp->ms_group != NULL) + metaslab_recalculate_weight_and_sort(msp); } static void @@ -1579,6 +1861,13 @@ metaslab_init(metaslab_group_t *mg, uint64_t id, uint6 /* * We only open space map objects that already exist. All others * will be opened when we finally allocate an object for it. + * + * Note: + * When called from vdev_expand(), we can't call into the DMU as + * we are holding the spa_config_lock as a writer and we would + * deadlock [see relevant comment in vdev_metaslab_init()]. in + * that case, the object parameter is zero though, so we won't + * call into the DMU. */ if (object != 0) { error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start, @@ -1590,14 +1879,17 @@ metaslab_init(metaslab_group_t *mg, uint64_t id, uint6 } ASSERT(ms->ms_sm != NULL); + ASSERT3S(space_map_allocated(ms->ms_sm), >=, 0); + ms->ms_allocated_space = space_map_allocated(ms->ms_sm); } /* - * We create the main range tree here, but we don't create the + * We create the ms_allocatable here, but we don't create the * other range trees until metaslab_sync_done(). This serves * two purposes: it allows metaslab_sync_done() to detect the - * addition of new space; and for debugging, it ensures that we'd - * data fault on any attempt to use this metaslab before it's ready. + * addition of new space; and for debugging, it ensures that + * we'd data fault on any attempt to use this metaslab before + * it's ready. */ ms->ms_allocatable = range_tree_create(&metaslab_rt_ops, ms); metaslab_group_add(mg, ms); @@ -1613,8 +1905,11 @@ metaslab_init(metaslab_group_t *mg, uint64_t id, uint6 * out this txg. This ensures that we don't attempt to allocate * from it before we have initialized it completely. */ - if (txg <= TXG_INITIAL) + if (txg <= TXG_INITIAL) { metaslab_sync_done(ms, 0); + metaslab_space_update(vd, mg->mg_class, + metaslab_allocated_space(ms), 0, 0); + } /* * If metaslab_debug_load is set and we're initializing a metaslab @@ -1648,7 +1943,7 @@ metaslab_fini(metaslab_t *msp) mutex_enter(&msp->ms_lock); VERIFY(msp->ms_group == NULL); metaslab_space_update(vd, mg->mg_class, - -space_map_allocated(msp->ms_sm), 0, -msp->ms_size); + -metaslab_allocated_space(msp), 0, -msp->ms_size); space_map_close(msp->ms_sm); @@ -1669,6 +1964,9 @@ metaslab_fini(metaslab_t *msp) range_tree_destroy(msp->ms_checkpointing); + for (int t = 0; t < TXG_SIZE; t++) + ASSERT(!txg_list_member(&vd->vdev_ms_list, msp, t)); + mutex_exit(&msp->ms_lock); cv_destroy(&msp->ms_load_cv); mutex_destroy(&msp->ms_lock); @@ -1684,7 +1982,7 @@ metaslab_fini(metaslab_t *msp) * This table defines a segment size based fragmentation metric that will * allow each metaslab to derive its own fragmentation value. This is done * by calculating the space in each bucket of the spacemap histogram and - * multiplying that by the fragmetation metric in this table. Doing + * multiplying that by the fragmentation metric in this table. Doing * this for all buckets and dividing it by the total amount of free * space in this metaslab (i.e. the total free space in all buckets) gives * us the fragmentation metric. This means that a high fragmentation metric @@ -1719,10 +2017,10 @@ int zfs_frag_table[FRAGMENTATION_TABLE_SIZE] = { }; /* - * Calclate the metaslab's fragmentation metric. A return value - * of ZFS_FRAG_INVALID means that the metaslab has not been upgraded and does - * not support this metric. Otherwise, the return value should be in the - * range [0, 100]. + * Calculate the metaslab's fragmentation metric and set ms_fragmentation. + * Setting this value to ZFS_FRAG_INVALID means that the metaslab has not + * been upgraded and does not support this metric. Otherwise, the return + * value should be in the range [0, 100]. */ static void metaslab_set_fragmentation(metaslab_t *msp) @@ -1815,7 +2113,7 @@ metaslab_space_weight(metaslab_t *msp) /* * The baseline weight is the metaslab's free space. */ - space = msp->ms_size - space_map_allocated(msp->ms_sm); + space = msp->ms_size - metaslab_allocated_space(msp); if (metaslab_fragmentation_factor_enabled && msp->ms_fragmentation != ZFS_FRAG_INVALID) { @@ -1919,14 +2217,38 @@ metaslab_weight_from_range_tree(metaslab_t *msp) static uint64_t metaslab_weight_from_spacemap(metaslab_t *msp) { - uint64_t weight = 0; + space_map_t *sm = msp->ms_sm; + ASSERT(!msp->ms_loaded); + ASSERT(sm != NULL); + ASSERT3U(space_map_object(sm), !=, 0); + ASSERT3U(sm->sm_dbuf->db_size, ==, sizeof (space_map_phys_t)); + /* + * Create a joint histogram from all the segments that have made + * it to the metaslab's space map histogram, that are not yet + * available for allocation because they are still in the freeing + * pipeline (e.g. freeing, freed, and defer trees). Then subtract + * these segments from the space map's histogram to get a more + * accurate weight. + */ + uint64_t deferspace_histogram[SPACE_MAP_HISTOGRAM_SIZE] = {0}; + for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) + deferspace_histogram[i] += msp->ms_synchist[i]; + for (int t = 0; t < TXG_DEFER_SIZE; t++) { + for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) { + deferspace_histogram[i] += msp->ms_deferhist[t][i]; + } + } + + uint64_t weight = 0; for (int i = SPACE_MAP_HISTOGRAM_SIZE - 1; i >= 0; i--) { - if (msp->ms_sm->sm_phys->smp_histogram[i] != 0) { - WEIGHT_SET_COUNT(weight, - msp->ms_sm->sm_phys->smp_histogram[i]); - WEIGHT_SET_INDEX(weight, i + - msp->ms_sm->sm_shift); + ASSERT3U(sm->sm_phys->smp_histogram[i], >=, + deferspace_histogram[i]); + uint64_t count = + sm->sm_phys->smp_histogram[i] - deferspace_histogram[i]; + if (count != 0) { + WEIGHT_SET_COUNT(weight, count); + WEIGHT_SET_INDEX(weight, i + sm->sm_shift); WEIGHT_SET_ACTIVE(weight, 0); break; } @@ -1951,7 +2273,7 @@ metaslab_segment_weight(metaslab_t *msp) /* * The metaslab is completely free. */ - if (space_map_allocated(msp->ms_sm) == 0) { + if (metaslab_allocated_space(msp) == 0) { int idx = highbit64(msp->ms_size) - 1; int max_idx = SPACE_MAP_HISTOGRAM_SIZE + shift - 1; @@ -1973,7 +2295,7 @@ metaslab_segment_weight(metaslab_t *msp) /* * If the metaslab is fully allocated then just make the weight 0. */ - if (space_map_allocated(msp->ms_sm) == msp->ms_size) + if (metaslab_allocated_space(msp) == msp->ms_size) return (0); /* * If the metaslab is already loaded, then use the range tree to @@ -2054,6 +2376,8 @@ metaslab_weight(metaslab_t *msp) */ if (msp->ms_loaded) msp->ms_max_size = metaslab_block_maxsize(msp); + else + ASSERT0(msp->ms_max_size); /* * Segment-based weighting requires space map histogram support. @@ -2069,6 +2393,15 @@ metaslab_weight(metaslab_t *msp) return (weight); } +void +metaslab_recalculate_weight_and_sort(metaslab_t *msp) +{ + /* note: we preserve the mask (e.g. indication of primary, etc..) */ + uint64_t was_active = msp->ms_weight & METASLAB_ACTIVE_MASK; + metaslab_group_sort(msp->ms_group, msp, + metaslab_weight(msp) | was_active); +} + static int metaslab_activate_allocator(metaslab_group_t *mg, metaslab_t *msp, int allocator, uint64_t activation_weight) @@ -2453,17 +2786,17 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) VERIFY(txg <= spa_final_dirty_txg(spa)); /* - * The only state that can actually be changing concurrently with - * metaslab_sync() is the metaslab's ms_allocatable. No other - * thread can be modifying this txg's alloc, freeing, + * The only state that can actually be changing concurrently + * with metaslab_sync() is the metaslab's ms_allocatable. No + * other thread can be modifying this txg's alloc, freeing, * freed, or space_map_phys_t. We drop ms_lock whenever we - * could call into the DMU, because the DMU can call down to us - * (e.g. via zio_free()) at any time. + * could call into the DMU, because the DMU can call down to + * us (e.g. via zio_free()) at any time. * * The spa_vdev_remove_thread() can be reading metaslab state - * concurrently, and it is locked out by the ms_sync_lock. Note - * that the ms_lock is insufficient for this, because it is dropped - * by space_map_write(). + * concurrently, and it is locked out by the ms_sync_lock. + * Note that the ms_lock is insufficient for this, because it + * is dropped by space_map_write(). */ tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg); @@ -2475,7 +2808,9 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) VERIFY0(space_map_open(&msp->ms_sm, mos, new_object, msp->ms_start, msp->ms_size, vd->vdev_ashift)); + ASSERT(msp->ms_sm != NULL); + ASSERT0(metaslab_allocated_space(msp)); } if (!range_tree_is_empty(msp->ms_checkpointing) && @@ -2523,6 +2858,11 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) mutex_enter(&msp->ms_lock); } + msp->ms_allocated_space += range_tree_space(alloctree); + ASSERT3U(msp->ms_allocated_space, >=, + range_tree_space(msp->ms_freeing)); + msp->ms_allocated_space -= range_tree_space(msp->ms_freeing); + if (!range_tree_is_empty(msp->ms_checkpointing)) { ASSERT(spa_has_checkpoint(spa)); ASSERT3P(vd->vdev_checkpoint_sm, !=, NULL); @@ -2536,14 +2876,13 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) space_map_write(vd->vdev_checkpoint_sm, msp->ms_checkpointing, SM_FREE, SM_NO_VDEVID, tx); mutex_enter(&msp->ms_lock); - space_map_update(vd->vdev_checkpoint_sm); spa->spa_checkpoint_info.sci_dspace += range_tree_space(msp->ms_checkpointing); vd->vdev_stat.vs_checkpoint_space += range_tree_space(msp->ms_checkpointing); ASSERT3U(vd->vdev_stat.vs_checkpoint_space, ==, - -vd->vdev_checkpoint_sm->sm_alloc); + -space_map_allocated(vd->vdev_checkpoint_sm)); range_tree_vacate(msp->ms_checkpointing, NULL, NULL); } @@ -2588,6 +2927,7 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) * time we load the space map. */ space_map_histogram_add(msp->ms_sm, msp->ms_freeing, tx); + metaslab_aux_histograms_update(msp); metaslab_group_histogram_add(mg, msp); metaslab_group_histogram_verify(mg); @@ -2595,16 +2935,18 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) /* * For sync pass 1, we avoid traversing this txg's free range tree - * and instead will just swap the pointers for freeing and - * freed. We can safely do this since the freed_tree is - * guaranteed to be empty on the initial pass. + * and instead will just swap the pointers for freeing and freed. + * We can safely do this since the freed_tree is guaranteed to be + * empty on the initial pass. */ if (spa_sync_pass(spa) == 1) { range_tree_swap(&msp->ms_freeing, &msp->ms_freed); + ASSERT0(msp->ms_allocated_this_txg); } else { range_tree_vacate(msp->ms_freeing, range_tree_add, msp->ms_freed); } + msp->ms_allocated_this_txg += range_tree_space(alloctree); range_tree_vacate(alloctree, NULL, NULL); ASSERT0(range_tree_space(msp->ms_allocating[txg & TXG_MASK])); @@ -2682,7 +3024,8 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) } defer_delta = 0; - alloc_delta = space_map_alloc_delta(msp->ms_sm); + alloc_delta = msp->ms_allocated_this_txg - + range_tree_space(msp->ms_freed); if (defer_allowed) { defer_delta = range_tree_space(msp->ms_freed) - range_tree_space(*defer_tree); @@ -2714,8 +3057,9 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) msp->ms_loaded ? range_tree_add : NULL, msp->ms_allocatable); } - space_map_update(msp->ms_sm); + msp->ms_synced_length = space_map_length(msp->ms_sm); + msp->ms_deferspace += defer_delta; ASSERT3S(msp->ms_deferspace, >=, 0); ASSERT3S(msp->ms_deferspace, <=, msp->ms_size); @@ -2726,6 +3070,7 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) */ vdev_dirty(vd, VDD_METASLAB, msp, txg + 1); } + metaslab_aux_histograms_update_done(msp, defer_allowed); if (msp->ms_new) { msp->ms_new = B_FALSE; @@ -2733,12 +3078,12 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) mg->mg_ms_ready++; mutex_exit(&mg->mg_lock); } + /* - * Calculate the new weights before unloading any metaslabs. - * This will give us the most accurate weighting. + * Re-sort metaslab within its group now that we've adjusted + * its allocatable space. */ - metaslab_group_sort(mg, msp, metaslab_weight(msp) | - (msp->ms_weight & METASLAB_ACTIVE_MASK)); + metaslab_recalculate_weight_and_sort(msp); /* * If the metaslab is loaded and we've not tried to load or allocate @@ -2765,6 +3110,7 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) ASSERT0(range_tree_space(msp->ms_freed)); ASSERT0(range_tree_space(msp->ms_checkpointing)); + msp->ms_allocated_this_txg = 0; mutex_exit(&msp->ms_lock); } @@ -4020,7 +4366,7 @@ metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint6 zio_alloc_list_t *zal, zio_t *zio, int allocator) { dva_t *dva = bp->blk_dva; - dva_t *hintdva = hintbp->blk_dva; + dva_t *hintdva = (hintbp != NULL) ? hintbp->blk_dva : NULL; int error = 0; ASSERT(bp->blk_birth == 0); @@ -4187,14 +4533,16 @@ metaslab_check_free_impl(vdev_t *vd, uint64_t offset, msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; mutex_enter(&msp->ms_lock); - if (msp->ms_loaded) - range_tree_verify(msp->ms_allocatable, offset, size); + if (msp->ms_loaded) { + range_tree_verify_not_present(msp->ms_allocatable, + offset, size); + } - range_tree_verify(msp->ms_freeing, offset, size); - range_tree_verify(msp->ms_checkpointing, offset, size); - range_tree_verify(msp->ms_freed, offset, size); + range_tree_verify_not_present(msp->ms_freeing, offset, size); + range_tree_verify_not_present(msp->ms_checkpointing, offset, size); + range_tree_verify_not_present(msp->ms_freed, offset, size); for (int j = 0; j < TXG_DEFER_SIZE; j++) - range_tree_verify(msp->ms_defer[j], offset, size); + range_tree_verify_not_present(msp->ms_defer[j], offset, size); mutex_exit(&msp->ms_lock); } Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/range_tree.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/range_tree.c Wed Nov 6 08:58:03 2019 (r354382) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/range_tree.c Wed Nov 6 09:00:06 2019 (r354383) @@ -311,13 +311,11 @@ range_tree_find(range_tree_t *rt, uint64_t start, uint } void -range_tree_verify(range_tree_t *rt, uint64_t off, uint64_t size) +range_tree_verify_not_present(range_tree_t *rt, uint64_t off, uint64_t size) { - range_seg_t *rs; - - rs = range_tree_find(rt, off, size); + range_seg_t *rs = range_tree_find(rt, off, size); if (rs != NULL) - panic("freeing free block; rs=%p", (void *)rs); + panic("segment already in tree; rs=%p", (void *)rs); } boolean_t Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/spa_checkpoint.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/spa_checkpoint.c Wed Nov 6 08:58:03 2019 (r354382) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/spa_checkpoint.c Wed Nov 6 09:00:06 2019 (r354383) @@ -129,7 +129,7 @@ * uberblock would reference data in the removed device. For this reason * and others of similar nature, we disallow the following operations that * can change the config: - * vdev removal and attach/detach, mirror splitting, and pool reguid. + * vdev removal and attach/detach, mirror splitting, and pool reguid. * * - As most of the checkpoint logic is implemented in the SPA and doesn't * distinguish datasets when it comes to space accounting, having a @@ -262,7 +262,7 @@ spa_checkpoint_accounting_verify(spa_t *spa) if (vd->vdev_checkpoint_sm != NULL) { ckpoint_sm_space_sum += - -vd->vdev_checkpoint_sm->sm_alloc; + -space_map_allocated(vd->vdev_checkpoint_sm); vs_ckpoint_space_sum += vd->vdev_stat.vs_checkpoint_space; ASSERT3U(ckpoint_sm_space_sum, ==, @@ -347,7 +347,7 @@ spa_checkpoint_discard_thread_sync(void *arg, dmu_tx_t error, vd->vdev_id); } ASSERT0(words_after); - ASSERT0(vd->vdev_checkpoint_sm->sm_alloc); + ASSERT0(space_map_allocated(vd->vdev_checkpoint_sm)); ASSERT0(space_map_length(vd->vdev_checkpoint_sm)); space_map_free(vd->vdev_checkpoint_sm, tx); Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/space_map.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/space_map.c Wed Nov 6 08:58:03 2019 (r354382) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/space_map.c Wed Nov 6 09:00:06 2019 (r354383) @@ -23,7 +23,7 @@ * Use is subject to license terms. */ /* - * Copyright (c) 2012, 2017 by Delphix. All rights reserved. + * Copyright (c) 2012, 2018 by Delphix. All rights reserved. */ #include @@ -81,20 +81,22 @@ sm_entry_is_double_word(uint64_t e) /* * Iterate through the space map, invoking the callback on each (non-debug) - * space map entry. + * space map entry. Stop after reading 'end' bytes of the space map. */ int -space_map_iterate(space_map_t *sm, sm_cb_t callback, void *arg) +space_map_iterate(space_map_t *sm, uint64_t end, sm_cb_t callback, void *arg) { - uint64_t sm_len = space_map_length(sm); - ASSERT3U(sm->sm_blksz, !=, 0); + uint64_t blksz = sm->sm_blksz; - dmu_prefetch(sm->sm_os, space_map_object(sm), 0, 0, sm_len, + ASSERT3U(blksz, !=, 0); + ASSERT3U(end, <=, space_map_length(sm)); + ASSERT0(P2PHASE(end, sizeof (uint64_t))); + + dmu_prefetch(sm->sm_os, space_map_object(sm), 0, 0, end, ZIO_PRIORITY_SYNC_READ); - uint64_t blksz = sm->sm_blksz; int error = 0; - for (uint64_t block_base = 0; block_base < sm_len && error == 0; + for (uint64_t block_base = 0; block_base < end && error == 0; block_base += blksz) { dmu_buf_t *db; error = dmu_buf_hold(sm->sm_os, space_map_object(sm), @@ -103,7 +105,7 @@ space_map_iterate(space_map_t *sm, sm_cb_t callback, v return (error); uint64_t *block_start = db->db_data; - uint64_t block_length = MIN(sm_len - block_base, blksz); + uint64_t block_length = MIN(end - block_base, blksz); uint64_t *block_end = block_start + (block_length / sizeof (uint64_t)); @@ -186,7 +188,7 @@ space_map_reversed_last_block_entries(space_map_t *sm, * dmu_buf_hold(). */ uint64_t last_word_offset = - sm->sm_phys->smp_objsize - sizeof (uint64_t); + sm->sm_phys->smp_length - sizeof (uint64_t); error = dmu_buf_hold(sm->sm_os, space_map_object(sm), last_word_offset, FTAG, &db, DMU_READ_NO_PREFETCH); if (error != 0) @@ -199,7 +201,7 @@ space_map_reversed_last_block_entries(space_map_t *sm, uint64_t *words = db->db_data; *nwords = - (sm->sm_phys->smp_objsize - db->db_offset) / sizeof (uint64_t); + (sm->sm_phys->smp_length - db->db_offset) / sizeof (uint64_t); ASSERT3U(*nwords, <=, bufsz / sizeof (uint64_t)); @@ -298,8 +300,7 @@ space_map_incremental_destroy(space_map_t *sm, sm_cb_t uint64_t e = buf[i]; if (sm_entry_is_debug(e)) { - sm->sm_phys->smp_objsize -= sizeof (uint64_t); - space_map_update(sm); + sm->sm_phys->smp_length -= sizeof (uint64_t); continue; } @@ -354,15 +355,13 @@ space_map_incremental_destroy(space_map_t *sm, sm_cb_t sm->sm_phys->smp_alloc -= entry_run; else sm->sm_phys->smp_alloc += entry_run; - sm->sm_phys->smp_objsize -= words * sizeof (uint64_t); - space_map_update(sm); + sm->sm_phys->smp_length -= words * sizeof (uint64_t); } } if (space_map_length(sm) == 0) { ASSERT0(error); - ASSERT0(sm->sm_phys->smp_objsize); - ASSERT0(sm->sm_alloc); + ASSERT0(space_map_allocated(sm)); } zio_buf_free(buf, bufsz); @@ -391,38 +390,42 @@ space_map_load_callback(space_map_entry_t *sme, void * } /* - * Load the space map disk into the specified range tree. Segments of maptype - * are added to the range tree, other segment types are removed. + * Load the spacemap into the rangetree, like space_map_load. But only + * read the first 'length' bytes of the spacemap. */ int -space_map_load(space_map_t *sm, range_tree_t *rt, maptype_t maptype) +space_map_load_length(space_map_t *sm, range_tree_t *rt, maptype_t maptype, + uint64_t length) { - uint64_t space; - int err; space_map_load_arg_t smla; VERIFY0(range_tree_space(rt)); - space = space_map_allocated(sm); - if (maptype == SM_FREE) { + if (maptype == SM_FREE) range_tree_add(rt, sm->sm_start, sm->sm_size); - space = sm->sm_size - space; - } smla.smla_rt = rt; smla.smla_sm = sm; smla.smla_type = maptype; - err = space_map_iterate(sm, space_map_load_callback, &smla); + int err = space_map_iterate(sm, length, + space_map_load_callback, &smla); - if (err == 0) { - VERIFY3U(range_tree_space(rt), ==, space); - } else { + if (err != 0) range_tree_vacate(rt, NULL, NULL); - } return (err); } +/* + * Load the space map disk into the specified range tree. Segments of maptype + * are added to the range tree, other segment types are removed. *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***