From owner-svn-src-all@freebsd.org Mon Jul 30 23:53:26 2018 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7DD6B106471A; Mon, 30 Jul 2018 23:53:26 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 32A4D8B688; Mon, 30 Jul 2018 23:53:26 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 1394C3CB1; Mon, 30 Jul 2018 23:53:26 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id w6UNrPVT074421; Mon, 30 Jul 2018 23:53:25 GMT (envelope-from mav@FreeBSD.org) Received: (from mav@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id w6UNrPFQ074418; Mon, 30 Jul 2018 23:53:25 GMT (envelope-from mav@FreeBSD.org) Message-Id: <201807302353.w6UNrPFQ074418@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mav set sender to mav@FreeBSD.org using -f From: Alexander Motin Date: Mon, 30 Jul 2018 23:53:25 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-vendor@freebsd.org Subject: svn commit: r336948 - in vendor-sys/illumos/dist/uts/common/fs/zfs: . sys X-SVN-Group: vendor-sys X-SVN-Commit-Author: mav X-SVN-Commit-Paths: in vendor-sys/illumos/dist/uts/common/fs/zfs: . sys X-SVN-Commit-Revision: 336948 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Jul 2018 23:53:26 -0000 Author: mav Date: Mon Jul 30 23:53:25 2018 New Revision: 336948 URL: https://svnweb.freebsd.org/changeset/base/336948 Log: 9112 Improve allocation performance on high-end systems On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads. illumos/illumos-gate@f78cdc34af236a6199dd9e21376f4a46348c0d56 Reviewed by: Matthew Ahrens Reviewed by: George Wilson Reviewed by: Serapheim Dimitropoulos Reviewed by: Alexander Motin Approved by: Gordon Ross Author: Paul Dagnelie Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c vendor-sys/illumos/dist/uts/common/fs/zfs/spa.c vendor-sys/illumos/dist/uts/common/fs/zfs/spa_misc.c vendor-sys/illumos/dist/uts/common/fs/zfs/sys/metaslab.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/metaslab_impl.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/spa_impl.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/vdev_impl.h vendor-sys/illumos/dist/uts/common/fs/zfs/sys/zio.h vendor-sys/illumos/dist/uts/common/fs/zfs/vdev.c vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_queue.c vendor-sys/illumos/dist/uts/common/fs/zfs/vdev_removal.c vendor-sys/illumos/dist/uts/common/fs/zfs/zil.c vendor-sys/illumos/dist/uts/common/fs/zfs/zio.c Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c Mon Jul 30 23:47:38 2018 (r336947) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/metaslab.c Mon Jul 30 23:53:25 2018 (r336948) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2011, 2015 by Delphix. All rights reserved. + * Copyright (c) 2011, 2018 by Delphix. All rights reserved. * Copyright (c) 2013 by Saso Kiselkov. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -212,6 +212,8 @@ static uint64_t metaslab_weight(metaslab_t *); static void metaslab_set_fragmentation(metaslab_t *); static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, boolean_t); static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t); +static void metaslab_passivate(metaslab_t *msp, uint64_t weight); +static uint64_t metaslab_weight_from_range_tree(metaslab_t *msp); kmem_cache_t *metaslab_alloc_trace_cache; @@ -231,7 +233,12 @@ metaslab_class_create(spa_t *spa, metaslab_ops_t *ops) mc->mc_rotor = NULL; mc->mc_ops = ops; mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL); - refcount_create_tracked(&mc->mc_alloc_slots); + mc->mc_alloc_slots = kmem_zalloc(spa->spa_alloc_count * + sizeof (refcount_t), KM_SLEEP); + mc->mc_alloc_max_slots = kmem_zalloc(spa->spa_alloc_count * + sizeof (uint64_t), KM_SLEEP); + for (int i = 0; i < spa->spa_alloc_count; i++) + refcount_create_tracked(&mc->mc_alloc_slots[i]); return (mc); } @@ -245,7 +252,12 @@ metaslab_class_destroy(metaslab_class_t *mc) ASSERT(mc->mc_space == 0); ASSERT(mc->mc_dspace == 0); - refcount_destroy(&mc->mc_alloc_slots); + for (int i = 0; i < mc->mc_spa->spa_alloc_count; i++) + refcount_destroy(&mc->mc_alloc_slots[i]); + kmem_free(mc->mc_alloc_slots, mc->mc_spa->spa_alloc_count * + sizeof (refcount_t)); + kmem_free(mc->mc_alloc_max_slots, mc->mc_spa->spa_alloc_count * + sizeof (uint64_t)); mutex_destroy(&mc->mc_lock); kmem_free(mc, sizeof (metaslab_class_t)); } @@ -442,6 +454,30 @@ metaslab_compare(const void *x1, const void *x2) const metaslab_t *m1 = x1; const metaslab_t *m2 = x2; + int sort1 = 0; + int sort2 = 0; + if (m1->ms_allocator != -1 && m1->ms_primary) + sort1 = 1; + else if (m1->ms_allocator != -1 && !m1->ms_primary) + sort1 = 2; + if (m2->ms_allocator != -1 && m2->ms_primary) + sort2 = 1; + else if (m2->ms_allocator != -1 && !m2->ms_primary) + sort2 = 2; + + /* + * Sort inactive metaslabs first, then primaries, then secondaries. When + * selecting a metaslab to allocate from, an allocator first tries its + * primary, then secondary active metaslab. If it doesn't have active + * metaslabs, or can't allocate from them, it searches for an inactive + * metaslab to activate. If it can't find a suitable one, it will steal + * a primary or secondary metaslab from another allocator. + */ + if (sort1 < sort2) + return (-1); + if (sort1 > sort2) + return (1); + if (m1->ms_weight < m2->ms_weight) return (1); if (m1->ms_weight > m2->ms_weight) @@ -593,12 +629,16 @@ metaslab_group_alloc_update(metaslab_group_t *mg) } metaslab_group_t * -metaslab_group_create(metaslab_class_t *mc, vdev_t *vd) +metaslab_group_create(metaslab_class_t *mc, vdev_t *vd, int allocators) { metaslab_group_t *mg; mg = kmem_zalloc(sizeof (metaslab_group_t), KM_SLEEP); mutex_init(&mg->mg_lock, NULL, MUTEX_DEFAULT, NULL); + mg->mg_primaries = kmem_zalloc(allocators * sizeof (metaslab_t *), + KM_SLEEP); + mg->mg_secondaries = kmem_zalloc(allocators * sizeof (metaslab_t *), + KM_SLEEP); avl_create(&mg->mg_metaslab_tree, metaslab_compare, sizeof (metaslab_t), offsetof(struct metaslab, ms_group_node)); mg->mg_vd = vd; @@ -606,8 +646,17 @@ metaslab_group_create(metaslab_class_t *mc, vdev_t *vd mg->mg_activation_count = 0; mg->mg_initialized = B_FALSE; mg->mg_no_free_space = B_TRUE; - refcount_create_tracked(&mg->mg_alloc_queue_depth); + mg->mg_allocators = allocators; + mg->mg_alloc_queue_depth = kmem_zalloc(allocators * sizeof (refcount_t), + KM_SLEEP); + mg->mg_cur_max_alloc_queue_depth = kmem_zalloc(allocators * + sizeof (uint64_t), KM_SLEEP); + for (int i = 0; i < allocators; i++) { + refcount_create_tracked(&mg->mg_alloc_queue_depth[i]); + mg->mg_cur_max_alloc_queue_depth[i] = 0; + } + mg->mg_taskq = taskq_create("metaslab_group_taskq", metaslab_load_pct, minclsyspri, 10, INT_MAX, TASKQ_THREADS_CPU_PCT); @@ -628,8 +677,20 @@ metaslab_group_destroy(metaslab_group_t *mg) taskq_destroy(mg->mg_taskq); avl_destroy(&mg->mg_metaslab_tree); + kmem_free(mg->mg_primaries, mg->mg_allocators * sizeof (metaslab_t *)); + kmem_free(mg->mg_secondaries, mg->mg_allocators * + sizeof (metaslab_t *)); mutex_destroy(&mg->mg_lock); - refcount_destroy(&mg->mg_alloc_queue_depth); + + for (int i = 0; i < mg->mg_allocators; i++) { + refcount_destroy(&mg->mg_alloc_queue_depth[i]); + mg->mg_cur_max_alloc_queue_depth[i] = 0; + } + kmem_free(mg->mg_alloc_queue_depth, mg->mg_allocators * + sizeof (refcount_t)); + kmem_free(mg->mg_cur_max_alloc_queue_depth, mg->mg_allocators * + sizeof (uint64_t)); + kmem_free(mg, sizeof (metaslab_group_t)); } @@ -708,6 +769,22 @@ metaslab_group_passivate(metaslab_group_t *mg) taskq_wait(mg->mg_taskq); spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER); metaslab_group_alloc_update(mg); + for (int i = 0; i < mg->mg_allocators; i++) { + metaslab_t *msp = mg->mg_primaries[i]; + if (msp != NULL) { + mutex_enter(&msp->ms_lock); + metaslab_passivate(msp, + metaslab_weight_from_range_tree(msp)); + mutex_exit(&msp->ms_lock); + } + msp = mg->mg_secondaries[i]; + if (msp != NULL) { + mutex_enter(&msp->ms_lock); + metaslab_passivate(msp, + metaslab_weight_from_range_tree(msp)); + mutex_exit(&msp->ms_lock); + } + } mgprev = mg->mg_prev; mgnext = mg->mg_next; @@ -848,6 +925,17 @@ metaslab_group_remove(metaslab_group_t *mg, metaslab_t } static void +metaslab_group_sort_impl(metaslab_group_t *mg, metaslab_t *msp, uint64_t weight) +{ + ASSERT(MUTEX_HELD(&mg->mg_lock)); + ASSERT(msp->ms_group == mg); + avl_remove(&mg->mg_metaslab_tree, msp); + msp->ms_weight = weight; + avl_add(&mg->mg_metaslab_tree, msp); + +} + +static void metaslab_group_sort(metaslab_group_t *mg, metaslab_t *msp, uint64_t weight) { /* @@ -858,10 +946,7 @@ metaslab_group_sort(metaslab_group_t *mg, metaslab_t * ASSERT(MUTEX_HELD(&msp->ms_lock)); mutex_enter(&mg->mg_lock); - ASSERT(msp->ms_group == mg); - avl_remove(&mg->mg_metaslab_tree, msp); - msp->ms_weight = weight; - avl_add(&mg->mg_metaslab_tree, msp); + metaslab_group_sort_impl(mg, msp, weight); mutex_exit(&mg->mg_lock); } @@ -909,7 +994,7 @@ metaslab_group_fragmentation(metaslab_group_t *mg) */ static boolean_t metaslab_group_allocatable(metaslab_group_t *mg, metaslab_group_t *rotor, - uint64_t psize) + uint64_t psize, int allocator) { spa_t *spa = mg->mg_vd->vdev_spa; metaslab_class_t *mc = mg->mg_class; @@ -938,7 +1023,7 @@ metaslab_group_allocatable(metaslab_group_t *mg, metas if (mg->mg_allocatable) { metaslab_group_t *mgp; int64_t qdepth; - uint64_t qmax = mg->mg_max_alloc_queue_depth; + uint64_t qmax = mg->mg_cur_max_alloc_queue_depth[allocator]; if (!mc->mc_alloc_throttle_enabled) return (B_TRUE); @@ -950,7 +1035,7 @@ metaslab_group_allocatable(metaslab_group_t *mg, metas if (mg->mg_no_free_space) return (B_FALSE); - qdepth = refcount_count(&mg->mg_alloc_queue_depth); + qdepth = refcount_count(&mg->mg_alloc_queue_depth[allocator]); /* * If this metaslab group is below its qmax or it's @@ -969,9 +1054,10 @@ metaslab_group_allocatable(metaslab_group_t *mg, metas * groups at the same time when we make this check. */ for (mgp = mg->mg_next; mgp != rotor; mgp = mgp->mg_next) { - qmax = mgp->mg_max_alloc_queue_depth; + qmax = mgp->mg_cur_max_alloc_queue_depth[allocator]; - qdepth = refcount_count(&mgp->mg_alloc_queue_depth); + qdepth = refcount_count( + &mgp->mg_alloc_queue_depth[allocator]); /* * If there is another metaslab group that @@ -1458,6 +1544,8 @@ metaslab_init(metaslab_group_t *mg, uint64_t id, uint6 ms->ms_id = id; ms->ms_start = id << vd->vdev_ms_shift; ms->ms_size = 1ULL << vd->vdev_ms_shift; + ms->ms_allocator = -1; + ms->ms_new = B_TRUE; /* * We only open space map objects that already exist. All others @@ -1553,6 +1641,7 @@ metaslab_fini(metaslab_t *msp) cv_destroy(&msp->ms_load_cv); mutex_destroy(&msp->ms_lock); mutex_destroy(&msp->ms_sync_lock); + ASSERT3U(msp->ms_allocator, ==, -1); kmem_free(msp, sizeof (metaslab_t)); } @@ -1949,19 +2038,59 @@ metaslab_weight(metaslab_t *msp) } static int -metaslab_activate(metaslab_t *msp, uint64_t activation_weight) +metaslab_activate_allocator(metaslab_group_t *mg, metaslab_t *msp, + int allocator, uint64_t activation_weight) { + /* + * If we're activating for the claim code, we don't want to actually + * set the metaslab up for a specific allocator. + */ + if (activation_weight == METASLAB_WEIGHT_CLAIM) + return (0); + metaslab_t **arr = (activation_weight == METASLAB_WEIGHT_PRIMARY ? + mg->mg_primaries : mg->mg_secondaries); + ASSERT(MUTEX_HELD(&msp->ms_lock)); + mutex_enter(&mg->mg_lock); + if (arr[allocator] != NULL) { + mutex_exit(&mg->mg_lock); + return (EEXIST); + } + arr[allocator] = msp; + ASSERT3S(msp->ms_allocator, ==, -1); + msp->ms_allocator = allocator; + msp->ms_primary = (activation_weight == METASLAB_WEIGHT_PRIMARY); + mutex_exit(&mg->mg_lock); + + return (0); +} + +static int +metaslab_activate(metaslab_t *msp, int allocator, uint64_t activation_weight) +{ + ASSERT(MUTEX_HELD(&msp->ms_lock)); + if ((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0) { + int error = 0; metaslab_load_wait(msp); if (!msp->ms_loaded) { - int error = metaslab_load(msp); - if (error) { + if ((error = metaslab_load(msp)) != 0) { metaslab_group_sort(msp->ms_group, msp, 0); return (error); } } + if ((msp->ms_weight & METASLAB_ACTIVE_MASK) != 0) { + /* + * The metaslab was activated for another allocator + * while we were waiting, we should reselect. + */ + return (EBUSY); + } + if ((error = metaslab_activate_allocator(msp->ms_group, msp, + allocator, activation_weight)) != 0) { + return (error); + } msp->ms_activation_weight = msp->ms_weight; metaslab_group_sort(msp->ms_group, msp, @@ -1974,6 +2103,34 @@ metaslab_activate(metaslab_t *msp, uint64_t activation } static void +metaslab_passivate_allocator(metaslab_group_t *mg, metaslab_t *msp, + uint64_t weight) +{ + ASSERT(MUTEX_HELD(&msp->ms_lock)); + if (msp->ms_weight & METASLAB_WEIGHT_CLAIM) { + metaslab_group_sort(mg, msp, weight); + return; + } + + mutex_enter(&mg->mg_lock); + ASSERT3P(msp->ms_group, ==, mg); + if (msp->ms_primary) { + ASSERT3U(0, <=, msp->ms_allocator); + ASSERT3U(msp->ms_allocator, <, mg->mg_allocators); + ASSERT3P(mg->mg_primaries[msp->ms_allocator], ==, msp); + ASSERT(msp->ms_weight & METASLAB_WEIGHT_PRIMARY); + mg->mg_primaries[msp->ms_allocator] = NULL; + } else { + ASSERT(msp->ms_weight & METASLAB_WEIGHT_SECONDARY); + ASSERT3P(mg->mg_secondaries[msp->ms_allocator], ==, msp); + mg->mg_secondaries[msp->ms_allocator] = NULL; + } + msp->ms_allocator = -1; + metaslab_group_sort_impl(mg, msp, weight); + mutex_exit(&mg->mg_lock); +} + +static void metaslab_passivate(metaslab_t *msp, uint64_t weight) { uint64_t size = weight & ~METASLAB_WEIGHT_TYPE; @@ -1988,7 +2145,7 @@ metaslab_passivate(metaslab_t *msp, uint64_t weight) ASSERT0(weight & METASLAB_ACTIVE_MASK); msp->ms_activation_weight = 0; - metaslab_group_sort(msp->ms_group, msp, weight); + metaslab_passivate_allocator(msp->ms_group, msp, weight); ASSERT((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0); } @@ -2542,11 +2699,18 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) vdev_dirty(vd, VDD_METASLAB, msp, txg + 1); } + if (msp->ms_new) { + msp->ms_new = B_FALSE; + mutex_enter(&mg->mg_lock); + mg->mg_ms_ready++; + mutex_exit(&mg->mg_lock); + } /* * Calculate the new weights before unloading any metaslabs. * This will give us the most accurate weighting. */ - metaslab_group_sort(mg, msp, metaslab_weight(msp)); + metaslab_group_sort(mg, msp, metaslab_weight(msp) | + (msp->ms_weight & METASLAB_ACTIVE_MASK)); /* * If the metaslab is loaded and we've not tried to load or allocate @@ -2558,6 +2722,10 @@ metaslab_sync_done(metaslab_t *msp, uint64_t txg) VERIFY0(range_tree_space( msp->ms_allocating[(txg + t) & TXG_MASK])); } + if (msp->ms_allocator != -1) { + metaslab_passivate(msp, msp->ms_weight & + ~METASLAB_ACTIVE_MASK); + } if (!metaslab_debug_unload) metaslab_unload(msp); @@ -2651,7 +2819,8 @@ metaslab_alloc_trace_fini(void) */ static void metaslab_trace_add(zio_alloc_list_t *zal, metaslab_group_t *mg, - metaslab_t *msp, uint64_t psize, uint32_t dva_id, uint64_t offset) + metaslab_t *msp, uint64_t psize, uint32_t dva_id, uint64_t offset, + int allocator) { if (!metaslab_trace_enabled) return; @@ -2684,6 +2853,7 @@ metaslab_trace_add(zio_alloc_list_t *zal, metaslab_gro mat->mat_dva_id = dva_id; mat->mat_offset = offset; mat->mat_weight = 0; + mat->mat_allocator = allocator; if (msp != NULL) mat->mat_weight = msp->ms_weight; @@ -2724,35 +2894,56 @@ metaslab_trace_fini(zio_alloc_list_t *zal) */ static void -metaslab_group_alloc_increment(spa_t *spa, uint64_t vdev, void *tag, int flags) +metaslab_group_alloc_increment(spa_t *spa, uint64_t vdev, void *tag, int flags, + int allocator) { if (!(flags & METASLAB_ASYNC_ALLOC) || - flags & METASLAB_DONT_THROTTLE) + (flags & METASLAB_DONT_THROTTLE)) return; metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg; if (!mg->mg_class->mc_alloc_throttle_enabled) return; - (void) refcount_add(&mg->mg_alloc_queue_depth, tag); + (void) refcount_add(&mg->mg_alloc_queue_depth[allocator], tag); } +static void +metaslab_group_increment_qdepth(metaslab_group_t *mg, int allocator) +{ + uint64_t max = mg->mg_max_alloc_queue_depth; + uint64_t cur = mg->mg_cur_max_alloc_queue_depth[allocator]; + while (cur < max) { + if (atomic_cas_64(&mg->mg_cur_max_alloc_queue_depth[allocator], + cur, cur + 1) == cur) { + atomic_inc_64( + &mg->mg_class->mc_alloc_max_slots[allocator]); + return; + } + cur = mg->mg_cur_max_alloc_queue_depth[allocator]; + } +} + void -metaslab_group_alloc_decrement(spa_t *spa, uint64_t vdev, void *tag, int flags) +metaslab_group_alloc_decrement(spa_t *spa, uint64_t vdev, void *tag, int flags, + int allocator, boolean_t io_complete) { if (!(flags & METASLAB_ASYNC_ALLOC) || - flags & METASLAB_DONT_THROTTLE) + (flags & METASLAB_DONT_THROTTLE)) return; metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg; if (!mg->mg_class->mc_alloc_throttle_enabled) return; - (void) refcount_remove(&mg->mg_alloc_queue_depth, tag); + (void) refcount_remove(&mg->mg_alloc_queue_depth[allocator], tag); + if (io_complete) + metaslab_group_increment_qdepth(mg, allocator); } void -metaslab_group_alloc_verify(spa_t *spa, const blkptr_t *bp, void *tag) +metaslab_group_alloc_verify(spa_t *spa, const blkptr_t *bp, void *tag, + int allocator) { #ifdef ZFS_DEBUG const dva_t *dva = bp->blk_dva; @@ -2761,7 +2952,8 @@ metaslab_group_alloc_verify(spa_t *spa, const blkptr_t for (int d = 0; d < ndvas; d++) { uint64_t vdev = DVA_GET_VDEV(&dva[d]); metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg; - VERIFY(refcount_not_held(&mg->mg_alloc_queue_depth, tag)); + VERIFY(refcount_not_held(&mg->mg_alloc_queue_depth[allocator], + tag)); } #endif } @@ -2803,91 +2995,146 @@ metaslab_block_alloc(metaslab_t *msp, uint64_t size, u return (start); } +/* + * Find the metaslab with the highest weight that is less than what we've + * already tried. In the common case, this means that we will examine each + * metaslab at most once. Note that concurrent callers could reorder metaslabs + * by activation/passivation once we have dropped the mg_lock. If a metaslab is + * activated by another thread, and we fail to allocate from the metaslab we + * have selected, we may not try the newly-activated metaslab, and instead + * activate another metaslab. This is not optimal, but generally does not cause + * any problems (a possible exception being if every metaslab is completely full + * except for the the newly-activated metaslab which we fail to examine). + */ +static metaslab_t * +find_valid_metaslab(metaslab_group_t *mg, uint64_t activation_weight, + dva_t *dva, int d, uint64_t min_distance, uint64_t asize, int allocator, + zio_alloc_list_t *zal, metaslab_t *search, boolean_t *was_active) +{ + avl_index_t idx; + avl_tree_t *t = &mg->mg_metaslab_tree; + metaslab_t *msp = avl_find(t, search, &idx); + if (msp == NULL) + msp = avl_nearest(t, idx, AVL_AFTER); + + for (; msp != NULL; msp = AVL_NEXT(t, msp)) { + int i; + if (!metaslab_should_allocate(msp, asize)) { + metaslab_trace_add(zal, mg, msp, asize, d, + TRACE_TOO_SMALL, allocator); + continue; + } + + /* + * If the selected metaslab is condensing, skip it. + */ + if (msp->ms_condensing) + continue; + + *was_active = msp->ms_allocator != -1; + /* + * If we're activating as primary, this is our first allocation + * from this disk, so we don't need to check how close we are. + * If the metaslab under consideration was already active, + * we're getting desperate enough to steal another allocator's + * metaslab, so we still don't care about distances. + */ + if (activation_weight == METASLAB_WEIGHT_PRIMARY || *was_active) + break; + + uint64_t target_distance = min_distance + + (space_map_allocated(msp->ms_sm) != 0 ? 0 : + min_distance >> 1); + + for (i = 0; i < d; i++) { + if (metaslab_distance(msp, &dva[i]) < target_distance) + break; + } + if (i == d) + break; + } + + if (msp != NULL) { + search->ms_weight = msp->ms_weight; + search->ms_start = msp->ms_start + 1; + search->ms_allocator = msp->ms_allocator; + search->ms_primary = msp->ms_primary; + } + return (msp); +} + +/* ARGSUSED */ static uint64_t metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal, - uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d) + uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d, + int allocator) { metaslab_t *msp = NULL; uint64_t offset = -1ULL; uint64_t activation_weight; - uint64_t target_distance; - int i; + boolean_t tertiary = B_FALSE; activation_weight = METASLAB_WEIGHT_PRIMARY; - for (i = 0; i < d; i++) { - if (DVA_GET_VDEV(&dva[i]) == mg->mg_vd->vdev_id) { + for (int i = 0; i < d; i++) { + if (activation_weight == METASLAB_WEIGHT_PRIMARY && + DVA_GET_VDEV(&dva[i]) == mg->mg_vd->vdev_id) { activation_weight = METASLAB_WEIGHT_SECONDARY; + } else if (activation_weight == METASLAB_WEIGHT_SECONDARY && + DVA_GET_VDEV(&dva[i]) == mg->mg_vd->vdev_id) { + tertiary = B_TRUE; break; } } + /* + * If we don't have enough metaslabs active to fill the entire array, we + * just use the 0th slot. + */ + if (mg->mg_ms_ready < mg->mg_allocators * 2) { + tertiary = B_FALSE; + allocator = 0; + } + + ASSERT3U(mg->mg_vd->vdev_ms_count, >=, 2); + metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP); search->ms_weight = UINT64_MAX; search->ms_start = 0; + /* + * At the end of the metaslab tree are the already-active metaslabs, + * first the primaries, then the secondaries. When we resume searching + * through the tree, we need to consider ms_allocator and ms_primary so + * we start in the location right after where we left off, and don't + * accidentally loop forever considering the same metaslabs. + */ + search->ms_allocator = -1; + search->ms_primary = B_TRUE; for (;;) { - boolean_t was_active; - avl_tree_t *t = &mg->mg_metaslab_tree; - avl_index_t idx; + boolean_t was_active = B_FALSE; mutex_enter(&mg->mg_lock); - /* - * Find the metaslab with the highest weight that is less - * than what we've already tried. In the common case, this - * means that we will examine each metaslab at most once. - * Note that concurrent callers could reorder metaslabs - * by activation/passivation once we have dropped the mg_lock. - * If a metaslab is activated by another thread, and we fail - * to allocate from the metaslab we have selected, we may - * not try the newly-activated metaslab, and instead activate - * another metaslab. This is not optimal, but generally - * does not cause any problems (a possible exception being - * if every metaslab is completely full except for the - * the newly-activated metaslab which we fail to examine). - */ - msp = avl_find(t, search, &idx); - if (msp == NULL) - msp = avl_nearest(t, idx, AVL_AFTER); - for (; msp != NULL; msp = AVL_NEXT(t, msp)) { - - if (!metaslab_should_allocate(msp, asize)) { - metaslab_trace_add(zal, mg, msp, asize, d, - TRACE_TOO_SMALL); - continue; - } - - /* - * If the selected metaslab is condensing, skip it. - */ - if (msp->ms_condensing) - continue; - - was_active = msp->ms_weight & METASLAB_ACTIVE_MASK; - if (activation_weight == METASLAB_WEIGHT_PRIMARY) - break; - - target_distance = min_distance + - (space_map_allocated(msp->ms_sm) != 0 ? 0 : - min_distance >> 1); - - for (i = 0; i < d; i++) { - if (metaslab_distance(msp, &dva[i]) < - target_distance) - break; - } - if (i == d) - break; + if (activation_weight == METASLAB_WEIGHT_PRIMARY && + mg->mg_primaries[allocator] != NULL) { + msp = mg->mg_primaries[allocator]; + was_active = B_TRUE; + } else if (activation_weight == METASLAB_WEIGHT_SECONDARY && + mg->mg_secondaries[allocator] != NULL && !tertiary) { + msp = mg->mg_secondaries[allocator]; + was_active = B_TRUE; + } else { + msp = find_valid_metaslab(mg, activation_weight, dva, d, + min_distance, asize, allocator, zal, search, + &was_active); } + mutex_exit(&mg->mg_lock); if (msp == NULL) { kmem_free(search, sizeof (*search)); return (-1ULL); } - search->ms_weight = msp->ms_weight; - search->ms_start = msp->ms_start + 1; mutex_enter(&msp->ms_lock); - /* * Ensure that the metaslab we have selected is still * capable of handling our request. It's possible that @@ -2901,18 +3148,32 @@ metaslab_group_alloc_normal(metaslab_group_t *mg, zio_ continue; } - if ((msp->ms_weight & METASLAB_WEIGHT_SECONDARY) && - activation_weight == METASLAB_WEIGHT_PRIMARY) { - metaslab_passivate(msp, - msp->ms_weight & ~METASLAB_ACTIVE_MASK); + /* + * If the metaslab is freshly activated for an allocator that + * isn't the one we're allocating from, or if it's a primary and + * we're seeking a secondary (or vice versa), we go back and + * select a new metaslab. + */ + if (!was_active && (msp->ms_weight & METASLAB_ACTIVE_MASK) && + (msp->ms_allocator != -1) && + (msp->ms_allocator != allocator || ((activation_weight == + METASLAB_WEIGHT_PRIMARY) != msp->ms_primary))) { mutex_exit(&msp->ms_lock); continue; } - if (metaslab_activate(msp, activation_weight) != 0) { + if (msp->ms_weight & METASLAB_WEIGHT_CLAIM) { + metaslab_passivate(msp, msp->ms_weight & + ~METASLAB_WEIGHT_CLAIM); mutex_exit(&msp->ms_lock); continue; } + + if (metaslab_activate(msp, allocator, activation_weight) != 0) { + mutex_exit(&msp->ms_lock); + continue; + } + msp->ms_selected_txg = txg; /* @@ -2925,7 +3186,7 @@ metaslab_group_alloc_normal(metaslab_group_t *mg, zio_ if (!metaslab_should_allocate(msp, asize)) { /* Passivate this metaslab and select a new one. */ metaslab_trace_add(zal, mg, msp, asize, d, - TRACE_TOO_SMALL); + TRACE_TOO_SMALL, allocator); goto next; } @@ -2936,13 +3197,15 @@ metaslab_group_alloc_normal(metaslab_group_t *mg, zio_ */ if (msp->ms_condensing) { metaslab_trace_add(zal, mg, msp, asize, d, - TRACE_CONDENSING); + TRACE_CONDENSING, allocator); + metaslab_passivate(msp, msp->ms_weight & + ~METASLAB_ACTIVE_MASK); mutex_exit(&msp->ms_lock); continue; } offset = metaslab_block_alloc(msp, asize, txg); - metaslab_trace_add(zal, mg, msp, asize, d, offset); + metaslab_trace_add(zal, mg, msp, asize, d, offset, allocator); if (offset != -1ULL) { /* Proactively passivate the metaslab, if needed */ @@ -2998,19 +3261,20 @@ next: static uint64_t metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal, - uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d) + uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d, + int allocator) { uint64_t offset; ASSERT(mg->mg_initialized); offset = metaslab_group_alloc_normal(mg, zal, asize, txg, - min_distance, dva, d); + min_distance, dva, d, allocator); mutex_enter(&mg->mg_lock); if (offset == -1ULL) { mg->mg_failed_allocations++; metaslab_trace_add(zal, mg, NULL, asize, d, - TRACE_GROUP_FAILURE); + TRACE_GROUP_FAILURE, allocator); if (asize == SPA_GANGBLOCKSIZE) { /* * This metaslab group was unable to allocate @@ -3045,7 +3309,7 @@ int ditto_same_vdev_distance_shift = 3; int metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize, dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags, - zio_alloc_list_t *zal) + zio_alloc_list_t *zal, int allocator) { metaslab_group_t *mg, *rotor; vdev_t *vd; @@ -3057,7 +3321,8 @@ metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, u * For testing, make some blocks above a certain size be gang blocks. */ if (psize >= metaslab_force_ganging && (ddi_get_lbolt() & 3) == 0) { - metaslab_trace_add(zal, NULL, NULL, psize, d, TRACE_FORCE_GANG); + metaslab_trace_add(zal, NULL, NULL, psize, d, TRACE_FORCE_GANG, + allocator); return (SET_ERROR(ENOSPC)); } @@ -3143,12 +3408,12 @@ top: */ if (allocatable && !GANG_ALLOCATION(flags) && !try_hard) { allocatable = metaslab_group_allocatable(mg, rotor, - psize); + psize, allocator); } if (!allocatable) { metaslab_trace_add(zal, mg, NULL, psize, d, - TRACE_NOT_ALLOCATABLE); + TRACE_NOT_ALLOCATABLE, allocator); goto next; } @@ -3163,7 +3428,7 @@ top: vd->vdev_state < VDEV_STATE_HEALTHY) && d == 0 && !try_hard && vd->vdev_children == 0) { metaslab_trace_add(zal, mg, NULL, psize, d, - TRACE_VDEV_ERROR); + TRACE_VDEV_ERROR, allocator); goto next; } @@ -3187,7 +3452,7 @@ top: ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0); uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg, - distance, dva, d); + distance, dva, d, allocator); if (offset != -1ULL) { /* @@ -3250,7 +3515,7 @@ next: bzero(&dva[d], sizeof (dva_t)); - metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC); + metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC, allocator); return (SET_ERROR(ENOSPC)); } @@ -3551,18 +3816,20 @@ metaslab_free_dva(spa_t *spa, const dva_t *dva, boolea * the reservation. */ boolean_t -metaslab_class_throttle_reserve(metaslab_class_t *mc, int slots, zio_t *zio, - int flags) +metaslab_class_throttle_reserve(metaslab_class_t *mc, int slots, int allocator, + zio_t *zio, int flags) { uint64_t available_slots = 0; boolean_t slot_reserved = B_FALSE; + uint64_t max = mc->mc_alloc_max_slots[allocator]; ASSERT(mc->mc_alloc_throttle_enabled); mutex_enter(&mc->mc_lock); - uint64_t reserved_slots = refcount_count(&mc->mc_alloc_slots); - if (reserved_slots < mc->mc_alloc_max_slots) - available_slots = mc->mc_alloc_max_slots - reserved_slots; + uint64_t reserved_slots = + refcount_count(&mc->mc_alloc_slots[allocator]); + if (reserved_slots < max) + available_slots = max - reserved_slots; if (slots <= available_slots || GANG_ALLOCATION(flags)) { /* @@ -3570,7 +3837,9 @@ metaslab_class_throttle_reserve(metaslab_class_t *mc, * them individually when an I/O completes. */ for (int d = 0; d < slots; d++) { - reserved_slots = refcount_add(&mc->mc_alloc_slots, zio); + reserved_slots = + refcount_add(&mc->mc_alloc_slots[allocator], + zio); } zio->io_flags |= ZIO_FLAG_IO_ALLOCATING; slot_reserved = B_TRUE; @@ -3581,12 +3850,14 @@ metaslab_class_throttle_reserve(metaslab_class_t *mc, } void -metaslab_class_throttle_unreserve(metaslab_class_t *mc, int slots, zio_t *zio) +metaslab_class_throttle_unreserve(metaslab_class_t *mc, int slots, + int allocator, zio_t *zio) { ASSERT(mc->mc_alloc_throttle_enabled); mutex_enter(&mc->mc_lock); for (int d = 0; d < slots; d++) { - (void) refcount_remove(&mc->mc_alloc_slots, zio); + (void) refcount_remove(&mc->mc_alloc_slots[allocator], + zio); } mutex_exit(&mc->mc_lock); } @@ -3608,7 +3879,13 @@ metaslab_claim_concrete(vdev_t *vd, uint64_t offset, u mutex_enter(&msp->ms_lock); if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded) - error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY); + error = metaslab_activate(msp, 0, METASLAB_WEIGHT_CLAIM); + /* + * No need to fail in that case; someone else has activated the + * metaslab, but that doesn't preclude us from using it. + */ + if (error == EBUSY) + error = 0; if (error == 0 && !range_tree_contains(msp->ms_allocatable, offset, size)) @@ -3713,7 +3990,7 @@ metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint6 int metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp, int ndvas, uint64_t txg, blkptr_t *hintbp, int flags, - zio_alloc_list_t *zal, zio_t *zio) + zio_alloc_list_t *zal, zio_t *zio, int allocator) { dva_t *dva = bp->blk_dva; dva_t *hintdva = hintbp->blk_dva; @@ -3736,12 +4013,13 @@ metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint6 for (int d = 0; d < ndvas; d++) { error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva, - txg, flags, zal); + txg, flags, zal, allocator); if (error != 0) { for (d--; d >= 0; d--) { metaslab_unalloc_dva(spa, &dva[d], txg); metaslab_group_alloc_decrement(spa, - DVA_GET_VDEV(&dva[d]), zio, flags); + DVA_GET_VDEV(&dva[d]), zio, flags, + allocator, B_FALSE); bzero(&dva[d], sizeof (dva_t)); } spa_config_exit(spa, SCL_ALLOC, FTAG); @@ -3752,7 +4030,7 @@ metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint6 * based on the newly allocated dva. */ metaslab_group_alloc_increment(spa, - DVA_GET_VDEV(&dva[d]), zio, flags); + DVA_GET_VDEV(&dva[d]), zio, flags, allocator); } } Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/spa.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/spa.c Mon Jul 30 23:47:38 2018 (r336947) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/spa.c Mon Jul 30 23:53:25 2018 (r336948) @@ -7398,9 +7398,11 @@ spa_sync(spa_t *spa, uint64_t txg) spa->spa_syncing_txg = txg; spa->spa_sync_pass = 0; - mutex_enter(&spa->spa_alloc_lock); - VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); - mutex_exit(&spa->spa_alloc_lock); + for (int i = 0; i < spa->spa_alloc_count; i++) { + mutex_enter(&spa->spa_alloc_locks[i]); + VERIFY0(avl_numnodes(&spa->spa_alloc_trees[i])); + mutex_exit(&spa->spa_alloc_locks[i]); + } /* * If there are any pending vdev state changes, convert them @@ -7459,7 +7461,7 @@ spa_sync(spa_t *spa, uint64_t txg) * The max queue depth will not change in the middle of syncing * out this txg. */ - uint64_t queue_depth_total = 0; + uint64_t slots_per_allocator = 0; for (int c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; metaslab_group_t *mg = tvd->vdev_mg; @@ -7473,18 +7475,23 @@ spa_sync(spa_t *spa, uint64_t txg) * allocations look at mg_max_alloc_queue_depth, and async * allocations all happen from spa_sync(). */ - ASSERT0(refcount_count(&mg->mg_alloc_queue_depth)); + for (int i = 0; i < spa->spa_alloc_count; i++) + ASSERT0(refcount_count(&(mg->mg_alloc_queue_depth[i]))); mg->mg_max_alloc_queue_depth = max_queue_depth; - queue_depth_total += mg->mg_max_alloc_queue_depth; + + for (int i = 0; i < spa->spa_alloc_count; i++) { + mg->mg_cur_max_alloc_queue_depth[i] = + zfs_vdev_def_queue_depth; + } + slots_per_allocator += zfs_vdev_def_queue_depth; } metaslab_class_t *mc = spa_normal_class(spa); - ASSERT0(refcount_count(&mc->mc_alloc_slots)); - mc->mc_alloc_max_slots = queue_depth_total; + for (int i = 0; i < spa->spa_alloc_count; i++) { + ASSERT0(refcount_count(&mc->mc_alloc_slots[i])); + mc->mc_alloc_max_slots[i] = slots_per_allocator; + } mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled; - ASSERT3U(mc->mc_alloc_max_slots, <=, - max_queue_depth * rvd->vdev_children); - for (int c = 0; c < rvd->vdev_children; c++) { vdev_t *vd = rvd->vdev_child[c]; vdev_indirect_state_sync_verify(vd); @@ -7661,9 +7668,11 @@ spa_sync(spa_t *spa, uint64_t txg) dsl_pool_sync_done(dp, txg); - mutex_enter(&spa->spa_alloc_lock); - VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); - mutex_exit(&spa->spa_alloc_lock); + for (int i = 0; i < spa->spa_alloc_count; i++) { *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***