From owner-svn-src-stable@freebsd.org Wed Jul 26 16:25:48 2017 Return-Path: Delivered-To: svn-src-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 39D75D98CDF; Wed, 26 Jul 2017 16:25:48 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0335869767; Wed, 26 Jul 2017 16:25:47 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v6QGPlKe068114; Wed, 26 Jul 2017 16:25:47 GMT (envelope-from mav@FreeBSD.org) Received: (from mav@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id v6QGPlUB068112; Wed, 26 Jul 2017 16:25:47 GMT (envelope-from mav@FreeBSD.org) Message-Id: <201707261625.v6QGPlUB068112@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mav set sender to mav@FreeBSD.org using -f From: Alexander Motin Date: Wed, 26 Jul 2017 16:25:47 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org Subject: svn commit: r321539 - in stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs: . sys X-SVN-Group: stable-11 X-SVN-Commit-Author: mav X-SVN-Commit-Paths: in stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs: . sys X-SVN-Commit-Revision: 321539 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for all the -stable branches of the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jul 2017 16:25:48 -0000 Author: mav Date: Wed Jul 26 16:25:46 2017 New Revision: 321539 URL: https://svnweb.freebsd.org/changeset/base/321539 Log: MFC r317527: MFV 316898 7613 ms_freetree[4] is only used in syncing context illumos/illumos-gate@5f145778012b555e084eacc858ead9e1e42bd149 https://github.com/illumos/illumos-gate/commit/5f145778012b555e084eacc858ead9e1e42bd149 https://www.illumos.org/issues/7613 metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Reviewed by: George Wilson Reviewed by: Alex Reece Approved by: Dan McDonald Author: Matthew Ahrens Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Directory Properties: stable/11/ (props changed) Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c Wed Jul 26 16:24:53 2017 (r321538) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c Wed Jul 26 16:25:46 2017 (r321539) @@ -533,7 +533,6 @@ metaslab_verify_space(metaslab_t *msp, uint64_t txg) { spa_t *spa = msp->ms_group->mg_vd->vdev_spa; uint64_t allocated = 0; - uint64_t freed = 0; uint64_t sm_free_space, msp_free_space; ASSERT(MUTEX_HELD(&msp->ms_lock)); @@ -563,10 +562,9 @@ metaslab_verify_space(metaslab_t *msp, uint64_t txg) allocated += range_tree_space(msp->ms_alloctree[(txg + t) & TXG_MASK]); } - freed = range_tree_space(msp->ms_freetree[TXG_CLEAN(txg) & TXG_MASK]); msp_free_space = range_tree_space(msp->ms_tree) + allocated + - msp->ms_deferspace + freed; + msp->ms_deferspace + range_tree_space(msp->ms_freedtree); VERIFY3U(sm_free_space, ==, msp_free_space); } @@ -1499,7 +1497,7 @@ metaslab_init(metaslab_group_t *mg, uint64_t id, uint6 /* * We create the main range tree here, but we don't create the - * alloctree and freetree until metaslab_sync_done(). This serves + * other range trees until metaslab_sync_done(). This serves * two purposes: it allows metaslab_sync_done() to detect the * addition of new space; and for debugging, it ensures that we'd * data fault on any attempt to use this metaslab before it's ready. @@ -1557,10 +1555,11 @@ metaslab_fini(metaslab_t *msp) metaslab_unload(msp); range_tree_destroy(msp->ms_tree); + range_tree_destroy(msp->ms_freeingtree); + range_tree_destroy(msp->ms_freedtree); for (int t = 0; t < TXG_SIZE; t++) { range_tree_destroy(msp->ms_alloctree[t]); - range_tree_destroy(msp->ms_freetree[t]); } for (int t = 0; t < TXG_DEFER_SIZE; t++) { @@ -2171,7 +2170,6 @@ static void metaslab_condense(metaslab_t *msp, uint64_t txg, dmu_tx_t *tx) { spa_t *spa = msp->ms_group->mg_vd->vdev_spa; - range_tree_t *freetree = msp->ms_freetree[txg & TXG_MASK]; range_tree_t *condense_tree; space_map_t *sm = msp->ms_sm; @@ -2202,9 +2200,9 @@ metaslab_condense(metaslab_t *msp, uint64_t txg, dmu_t /* * Remove what's been freed in this txg from the condense_tree. * Since we're in sync_pass 1, we know that all the frees from - * this txg are in the freetree. + * this txg are in the freeingtree. */ - range_tree_walk(freetree, range_tree_remove, condense_tree); + range_tree_walk(msp->ms_freeingtree, range_tree_remove, condense_tree); for (int t = 0; t < TXG_DEFER_SIZE; t++) { range_tree_walk(msp->ms_defertree[t], @@ -2260,9 +2258,6 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) spa_t *spa = vd->vdev_spa; objset_t *mos = spa_meta_objset(spa); range_tree_t *alloctree = msp->ms_alloctree[txg & TXG_MASK]; - range_tree_t **freetree = &msp->ms_freetree[txg & TXG_MASK]; - range_tree_t **freed_tree = - &msp->ms_freetree[TXG_CLEAN(txg) & TXG_MASK]; dmu_tx_t *tx; uint64_t object = space_map_object(msp->ms_sm); @@ -2271,14 +2266,14 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) /* * This metaslab has just been added so there's no work to do now. */ - if (*freetree == NULL) { + if (msp->ms_freeingtree == NULL) { ASSERT3P(alloctree, ==, NULL); return; } ASSERT3P(alloctree, !=, NULL); - ASSERT3P(*freetree, !=, NULL); - ASSERT3P(*freed_tree, !=, NULL); + ASSERT3P(msp->ms_freeingtree, !=, NULL); + ASSERT3P(msp->ms_freedtree, !=, NULL); /* * Normally, we don't want to process a metaslab if there @@ -2286,14 +2281,14 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) * is being forced to condense we need to let it through. */ if (range_tree_space(alloctree) == 0 && - range_tree_space(*freetree) == 0 && + range_tree_space(msp->ms_freeingtree) == 0 && !msp->ms_condense_wanted) return; /* * The only state that can actually be changing concurrently with * metaslab_sync() is the metaslab's ms_tree. No other thread can - * be modifying this txg's alloctree, freetree, freed_tree, or + * be modifying this txg's alloctree, freeingtree, freedtree, or * space_map_phys_t. Therefore, we only hold ms_lock to satify * space map ASSERTs. We drop it whenever we call into the DMU, * because the DMU can call down to us (e.g. via zio_free()) at @@ -2330,7 +2325,7 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) metaslab_condense(msp, txg, tx); } else { space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx); - space_map_write(msp->ms_sm, *freetree, SM_FREE, tx); + space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx); } if (msp->ms_loaded) { @@ -2350,7 +2345,7 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) * to accurately reflect all free space even if some space * is not yet available for allocation (i.e. deferred). */ - space_map_histogram_add(msp->ms_sm, *freed_tree, tx); + space_map_histogram_add(msp->ms_sm, msp->ms_freedtree, tx); /* * Add back any deferred free space that has not been @@ -2372,7 +2367,7 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) * then we will lose some accuracy but will correct it the next * time we load the space map. */ - space_map_histogram_add(msp->ms_sm, *freetree, tx); + space_map_histogram_add(msp->ms_sm, msp->ms_freeingtree, tx); metaslab_group_histogram_add(mg, msp); metaslab_group_histogram_verify(mg); @@ -2380,20 +2375,21 @@ metaslab_sync(metaslab_t *msp, uint64_t txg) /* * For sync pass 1, we avoid traversing this txg's free range tree - * and instead will just swap the pointers for freetree and - * freed_tree. We can safely do this since the freed_tree is + * and instead will just swap the pointers for freeingtree and + * freedtree. We can safely do this since the freed_tree is * guaranteed to be empty on the initial pass. */ if (spa_sync_pass(spa) == 1) { - range_tree_swap(freetree, freed_tree); + range_tree_swap(&msp->ms_freeingtree, &msp->ms_freedtree); } else { - range_tree_vacate(*freetree, range_tree_add, *freed_tree); + range_tree_vacate(msp->ms_freeingtree, + range_tree_add, msp->ms_freedtree); } range_tree_vacate(alloctree, NULL, NULL); ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK])); ASSERT0(range_tree_space(msp->ms_alloctree[TXG_CLEAN(txg) & TXG_MASK])); - ASSERT0(range_tree_space(msp->ms_freetree[txg & TXG_MASK])); + ASSERT0(range_tree_space(msp->ms_freeingtree)); mutex_exit(&msp->ms_lock); @@ -2415,7 +2411,6 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) metaslab_group_t *mg = msp->ms_group; vdev_t *vd = mg->mg_vd; spa_t *spa = vd->vdev_spa; - range_tree_t **freed_tree; range_tree_t **defer_tree; int64_t alloc_delta, defer_delta; boolean_t defer_allowed = B_TRUE; @@ -2426,20 +2421,24 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) /* * If this metaslab is just becoming available, initialize its - * alloctrees, freetrees, and defertree and add its capacity to - * the vdev. + * range trees and add its capacity to the vdev. */ - if (msp->ms_freetree[TXG_CLEAN(txg) & TXG_MASK] == NULL) { + if (msp->ms_freedtree == NULL) { for (int t = 0; t < TXG_SIZE; t++) { ASSERT(msp->ms_alloctree[t] == NULL); - ASSERT(msp->ms_freetree[t] == NULL); msp->ms_alloctree[t] = range_tree_create(NULL, msp, &msp->ms_lock); - msp->ms_freetree[t] = range_tree_create(NULL, msp, - &msp->ms_lock); } + ASSERT3P(msp->ms_freeingtree, ==, NULL); + msp->ms_freeingtree = range_tree_create(NULL, msp, + &msp->ms_lock); + + ASSERT3P(msp->ms_freedtree, ==, NULL); + msp->ms_freedtree = range_tree_create(NULL, msp, + &msp->ms_lock); + for (int t = 0; t < TXG_DEFER_SIZE; t++) { ASSERT(msp->ms_defertree[t] == NULL); @@ -2450,7 +2449,6 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) vdev_space_update(vd, 0, 0, msp->ms_size); } - freed_tree = &msp->ms_freetree[TXG_CLEAN(txg) & TXG_MASK]; defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE]; uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) - @@ -2462,7 +2460,7 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) defer_delta = 0; alloc_delta = space_map_alloc_delta(msp->ms_sm); if (defer_allowed) { - defer_delta = range_tree_space(*freed_tree) - + defer_delta = range_tree_space(msp->ms_freedtree) - range_tree_space(*defer_tree); } else { defer_delta -= range_tree_space(*defer_tree); @@ -2470,9 +2468,6 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) vdev_space_update(vd, alloc_delta + defer_delta, defer_delta, 0); - ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK])); - ASSERT0(range_tree_space(msp->ms_freetree[txg & TXG_MASK])); - /* * If there's a metaslab_load() in progress, wait for it to complete * so that we have a consistent view of the in-core space map. @@ -2488,9 +2483,9 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) range_tree_vacate(*defer_tree, msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree); if (defer_allowed) { - range_tree_swap(freed_tree, defer_tree); + range_tree_swap(&msp->ms_freedtree, defer_tree); } else { - range_tree_vacate(*freed_tree, + range_tree_vacate(msp->ms_freedtree, msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree); } @@ -3250,10 +3245,10 @@ metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64 range_tree_add(msp->ms_tree, offset, size); msp->ms_max_size = metaslab_block_maxsize(msp); } else { - if (range_tree_space(msp->ms_freetree[txg & TXG_MASK]) == 0) + VERIFY3U(txg, ==, spa->spa_syncing_txg); + if (range_tree_space(msp->ms_freeingtree) == 0) vdev_dirty(vd, VDD_METASLAB, msp, txg); - range_tree_add(msp->ms_freetree[txg & TXG_MASK], - offset, size); + range_tree_add(msp->ms_freeingtree, offset, size); } mutex_exit(&msp->ms_lock); @@ -3485,8 +3480,8 @@ metaslab_check_free(spa_t *spa, const blkptr_t *bp) if (msp->ms_loaded) range_tree_verify(msp->ms_tree, offset, size); - for (int j = 0; j < TXG_SIZE; j++) - range_tree_verify(msp->ms_freetree[j], offset, size); + range_tree_verify(msp->ms_freeingtree, offset, size); + range_tree_verify(msp->ms_freedtree, offset, size); for (int j = 0; j < TXG_DEFER_SIZE; j++) range_tree_verify(msp->ms_defertree[j], offset, size); } Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Wed Jul 26 16:24:53 2017 (r321538) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Wed Jul 26 16:25:46 2017 (r321539) @@ -255,21 +255,24 @@ struct metaslab_group { #define MAX_LBAS 64 /* - * Each metaslab maintains a set of in-core trees to track metaslab operations. - * The in-core free tree (ms_tree) contains the current list of free segments. - * As blocks are allocated, the allocated segment are removed from the ms_tree - * and added to a per txg allocation tree (ms_alloctree). As blocks are freed, - * they are added to the per txg free tree (ms_freetree). These per txg - * trees allow us to process all allocations and frees in syncing context - * where it is safe to update the on-disk space maps. One additional in-core - * tree is maintained to track deferred frees (ms_defertree). Once a block - * is freed it will move from the ms_freetree to the ms_defertree. A deferred - * free means that a block has been freed but cannot be used by the pool - * until TXG_DEFER_SIZE transactions groups later. For example, a block - * that is freed in txg 50 will not be available for reallocation until - * txg 52 (50 + TXG_DEFER_SIZE). This provides a safety net for uberblock - * rollback. A pool could be safely rolled back TXG_DEFERS_SIZE - * transactions groups and ensure that no block has been reallocated. + * Each metaslab maintains a set of in-core trees to track metaslab + * operations. The in-core free tree (ms_tree) contains the list of + * free segments which are eligible for allocation. As blocks are + * allocated, the allocated segments are removed from the ms_tree and + * added to a per txg allocation tree (ms_alloctree). This allows us to + * process all allocations in syncing context where it is safe to update + * the on-disk space maps. Frees are also processed in syncing context. + * Most frees are generated from syncing context, and those that are not + * are held in the spa_free_bplist for processing in syncing context. + * An additional set of in-core trees is maintained to track deferred + * frees (ms_defertree). Once a block is freed it will move from the + * ms_freedtree to the ms_defertree. A deferred free means that a block + * has been freed but cannot be used by the pool until TXG_DEFER_SIZE + * transactions groups later. For example, a block that is freed in txg + * 50 will not be available for reallocation until txg 52 (50 + + * TXG_DEFER_SIZE). This provides a safety net for uberblock rollback. + * A pool could be safely rolled back TXG_DEFERS_SIZE transactions + * groups and ensure that no block has been reallocated. * * The simplified transition diagram looks like this: * @@ -277,33 +280,34 @@ struct metaslab_group { * ALLOCATE * | * V - * free segment (ms_tree) --------> ms_alloctree ----> (write to space map) + * free segment (ms_tree) -----> ms_alloctree[4] ----> (write to space map) * ^ - * | - * | ms_freetree <--- FREE + * | ms_freeingtree <--- FREE * | | + * | v + * | ms_freedtree * | | - * | | - * +----------- ms_defertree <-------+---------> (write to space map) + * +-------- ms_defertree[2] <-------+---------> (write to space map) * * * Each metaslab's space is tracked in a single space map in the MOS, - * which is only updated in syncing context. Each time we sync a txg, - * we append the allocs and frees from that txg to the space map. - * The pool space is only updated once all metaslabs have finished syncing. + * which is only updated in syncing context. Each time we sync a txg, + * we append the allocs and frees from that txg to the space map. The + * pool space is only updated once all metaslabs have finished syncing. * - * To load the in-core free tree we read the space map from disk. - * This object contains a series of alloc and free records that are - * combined to make up the list of all free segments in this metaslab. These + * To load the in-core free tree we read the space map from disk. This + * object contains a series of alloc and free records that are combined + * to make up the list of all free segments in this metaslab. These * segments are represented in-core by the ms_tree and are stored in an * AVL tree. * * As the space map grows (as a result of the appends) it will - * eventually become space-inefficient. When the metaslab's in-core free tree - * is zfs_condense_pct/100 times the size of the minimal on-disk - * representation, we rewrite it in its minimized form. If a metaslab - * needs to condense then we must set the ms_condensing flag to ensure - * that allocations are not performed on the metaslab that is being written. + * eventually become space-inefficient. When the metaslab's in-core + * free tree is zfs_condense_pct/100 times the size of the minimal + * on-disk representation, we rewrite it in its minimized form. If a + * metaslab needs to condense then we must set the ms_condensing flag to + * ensure that allocations are not performed on the metaslab that is + * being written. */ struct metaslab { kmutex_t ms_lock; @@ -315,9 +319,16 @@ struct metaslab { uint64_t ms_fragmentation; range_tree_t *ms_alloctree[TXG_SIZE]; - range_tree_t *ms_freetree[TXG_SIZE]; - range_tree_t *ms_defertree[TXG_DEFER_SIZE]; range_tree_t *ms_tree; + + /* + * The following range trees are accessed only from syncing context. + * ms_free*tree only have entries while syncing, and are empty + * between syncs. + */ + range_tree_t *ms_freeingtree; /* to free this syncing txg */ + range_tree_t *ms_freedtree; /* already freed this syncing txg */ + range_tree_t *ms_defertree[TXG_DEFER_SIZE]; boolean_t ms_condensing; /* condensing? */ boolean_t ms_condense_wanted;