From owner-svn-src-head@freebsd.org Sat Sep 3 10:04:39 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D3DF9BCEC03; Sat, 3 Sep 2016 10:04:39 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 786E0128; Sat, 3 Sep 2016 10:04:39 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id u83A4cwT097116; Sat, 3 Sep 2016 10:04:38 GMT (envelope-from mav@FreeBSD.org) Received: (from mav@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id u83A4bI0097105; Sat, 3 Sep 2016 10:04:37 GMT (envelope-from mav@FreeBSD.org) Message-Id: <201609031004.u83A4bI0097105@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mav set sender to mav@FreeBSD.org using -f From: Alexander Motin Date: Sat, 3 Sep 2016 10:04:37 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r305331 - in head/sys/cddl/contrib/opensolaris/uts/common: fs/zfs fs/zfs/sys sys/fs X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Sep 2016 10:04:40 -0000 Author: mav Date: Sat Sep 3 10:04:37 2016 New Revision: 305331 URL: https://svnweb.freebsd.org/changeset/base/305331 Log: MFV r304155: 7090 zfs should improve allocation order and throttle allocations illumos/illumos-gate@0f7643c7376dd69a08acbfc9d1d7d548b10c846a https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b 10c846a https://www.illumos.org/issues/7090 When write I/Os are issued, they are issued in block order but the ZIO pipelin e will drive them asynchronously through the allocation stage which can result i n blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able t o detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocatio n is successfully completed it's scheduled on a given top-level vdev. Each top- level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. Reviewed by: Adam Leventhal Reviewed by: Alex Reece Reviewed by: Christopher Siden Reviewed by: Dan Kimmel Reviewed by: Matthew Ahrens Reviewed by: Paul Dagnelie Reviewed by: Prakash Surya Reviewed by: Sebastien Roy Approved by: Robert Mustacchi Author: George Wilson Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/refcount.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/refcount.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_impl.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_cache.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c head/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h Directory Properties: head/sys/cddl/contrib/opensolaris/ (props changed) Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c Sat Sep 3 10:04:37 2016 (r305331) @@ -38,17 +38,8 @@ SYSCTL_DECL(_vfs_zfs); SYSCTL_NODE(_vfs_zfs, OID_AUTO, metaslab, CTLFLAG_RW, 0, "ZFS metaslab"); -/* - * Allow allocations to switch to gang blocks quickly. We do this to - * avoid having to load lots of space_maps in a given txg. There are, - * however, some cases where we want to avoid "fast" ganging and instead - * we want to do an exhaustive search of all metaslabs on this device. - * Currently we don't allow any gang, slog, or dump device related allocations - * to "fast" gang. - */ -#define CAN_FASTGANG(flags) \ - (!((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER | \ - METASLAB_GANG_AVOID))) +#define GANG_ALLOCATION(flags) \ + ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER)) #define METASLAB_WEIGHT_PRIMARY (1ULL << 63) #define METASLAB_WEIGHT_SECONDARY (1ULL << 62) @@ -256,6 +247,8 @@ metaslab_class_create(spa_t *spa, metasl mc->mc_spa = spa; mc->mc_rotor = NULL; mc->mc_ops = ops; + mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL); + refcount_create_tracked(&mc->mc_alloc_slots); return (mc); } @@ -269,6 +262,8 @@ metaslab_class_destroy(metaslab_class_t ASSERT(mc->mc_space == 0); ASSERT(mc->mc_dspace == 0); + refcount_destroy(&mc->mc_alloc_slots); + mutex_destroy(&mc->mc_lock); kmem_free(mc, sizeof (metaslab_class_t)); } @@ -512,9 +507,10 @@ metaslab_compare(const void *x1, const v /* * Update the allocatable flag and the metaslab group's capacity. * The allocatable flag is set to true if the capacity is below - * the zfs_mg_noalloc_threshold. If a metaslab group transitions - * from allocatable to non-allocatable or vice versa then the metaslab - * group's class is updated to reflect the transition. + * the zfs_mg_noalloc_threshold or has a fragmentation value that is + * greater than zfs_mg_fragmentation_threshold. If a metaslab group + * transitions from allocatable to non-allocatable or vice versa then the + * metaslab group's class is updated to reflect the transition. */ static void metaslab_group_alloc_update(metaslab_group_t *mg) @@ -523,22 +519,45 @@ metaslab_group_alloc_update(metaslab_gro metaslab_class_t *mc = mg->mg_class; vdev_stat_t *vs = &vd->vdev_stat; boolean_t was_allocatable; + boolean_t was_initialized; ASSERT(vd == vd->vdev_top); mutex_enter(&mg->mg_lock); was_allocatable = mg->mg_allocatable; + was_initialized = mg->mg_initialized; mg->mg_free_capacity = ((vs->vs_space - vs->vs_alloc) * 100) / (vs->vs_space + 1); + mutex_enter(&mc->mc_lock); + + /* + * If the metaslab group was just added then it won't + * have any space until we finish syncing out this txg. + * At that point we will consider it initialized and available + * for allocations. We also don't consider non-activated + * metaslab groups (e.g. vdevs that are in the middle of being removed) + * to be initialized, because they can't be used for allocation. + */ + mg->mg_initialized = metaslab_group_initialized(mg); + if (!was_initialized && mg->mg_initialized) { + mc->mc_groups++; + } else if (was_initialized && !mg->mg_initialized) { + ASSERT3U(mc->mc_groups, >, 0); + mc->mc_groups--; + } + if (mg->mg_initialized) + mg->mg_no_free_space = B_FALSE; + /* * A metaslab group is considered allocatable if it has plenty * of free space or is not heavily fragmented. We only take * fragmentation into account if the metaslab group has a valid * fragmentation metric (i.e. a value between 0 and 100). */ - mg->mg_allocatable = (mg->mg_free_capacity > zfs_mg_noalloc_threshold && + mg->mg_allocatable = (mg->mg_activation_count > 0 && + mg->mg_free_capacity > zfs_mg_noalloc_threshold && (mg->mg_fragmentation == ZFS_FRAG_INVALID || mg->mg_fragmentation <= zfs_mg_fragmentation_threshold)); @@ -561,6 +580,7 @@ metaslab_group_alloc_update(metaslab_gro mc->mc_alloc_groups--; else if (!was_allocatable && mg->mg_allocatable) mc->mc_alloc_groups++; + mutex_exit(&mc->mc_lock); mutex_exit(&mg->mg_lock); } @@ -577,6 +597,9 @@ metaslab_group_create(metaslab_class_t * mg->mg_vd = vd; mg->mg_class = mc; mg->mg_activation_count = 0; + mg->mg_initialized = B_FALSE; + mg->mg_no_free_space = B_TRUE; + refcount_create_tracked(&mg->mg_alloc_queue_depth); mg->mg_taskq = taskq_create("metaslab_group_taskq", metaslab_load_pct, minclsyspri, 10, INT_MAX, TASKQ_THREADS_CPU_PCT); @@ -599,6 +622,7 @@ metaslab_group_destroy(metaslab_group_t taskq_destroy(mg->mg_taskq); avl_destroy(&mg->mg_metaslab_tree); mutex_destroy(&mg->mg_lock); + refcount_destroy(&mg->mg_alloc_queue_depth); kmem_free(mg, sizeof (metaslab_group_t)); } @@ -670,6 +694,15 @@ metaslab_group_passivate(metaslab_group_ metaslab_class_minblocksize_update(mc); } +boolean_t +metaslab_group_initialized(metaslab_group_t *mg) +{ + vdev_t *vd = mg->mg_vd; + vdev_stat_t *vs = &vd->vdev_stat; + + return (vs->vs_space != 0 && mg->mg_activation_count > 0); +} + uint64_t metaslab_group_get_space(metaslab_group_t *mg) { @@ -839,30 +872,97 @@ metaslab_group_fragmentation(metaslab_gr * group should avoid allocations if its free capacity is less than the * zfs_mg_noalloc_threshold or its fragmentation metric is greater than * zfs_mg_fragmentation_threshold and there is at least one metaslab group - * that can still handle allocations. + * that can still handle allocations. If the allocation throttle is enabled + * then we skip allocations to devices that have reached their maximum + * allocation queue depth unless the selected metaslab group is the only + * eligible group remaining. */ static boolean_t -metaslab_group_allocatable(metaslab_group_t *mg) +metaslab_group_allocatable(metaslab_group_t *mg, metaslab_group_t *rotor, + uint64_t psize) { - vdev_t *vd = mg->mg_vd; - spa_t *spa = vd->vdev_spa; + spa_t *spa = mg->mg_vd->vdev_spa; metaslab_class_t *mc = mg->mg_class; /* - * We use two key metrics to determine if a metaslab group is - * considered allocatable -- free space and fragmentation. If - * the free space is greater than the free space threshold and - * the fragmentation is less than the fragmentation threshold then - * consider the group allocatable. There are two case when we will - * not consider these key metrics. The first is if the group is - * associated with a slog device and the second is if all groups - * in this metaslab class have already been consider ineligible + * We can only consider skipping this metaslab group if it's + * in the normal metaslab class and there are other metaslab + * groups to select from. Otherwise, we always consider it eligible * for allocations. */ - return ((mg->mg_free_capacity > zfs_mg_noalloc_threshold && - (mg->mg_fragmentation == ZFS_FRAG_INVALID || - mg->mg_fragmentation <= zfs_mg_fragmentation_threshold)) || - mc != spa_normal_class(spa) || mc->mc_alloc_groups == 0); + if (mc != spa_normal_class(spa) || mc->mc_groups <= 1) + return (B_TRUE); + + /* + * If the metaslab group's mg_allocatable flag is set (see comments + * in metaslab_group_alloc_update() for more information) and + * the allocation throttle is disabled then allow allocations to this + * device. However, if the allocation throttle is enabled then + * check if we have reached our allocation limit (mg_alloc_queue_depth) + * to determine if we should allow allocations to this metaslab group. + * If all metaslab groups are no longer considered allocatable + * (mc_alloc_groups == 0) or we're trying to allocate the smallest + * gang block size then we allow allocations on this metaslab group + * regardless of the mg_allocatable or throttle settings. + */ + if (mg->mg_allocatable) { + metaslab_group_t *mgp; + int64_t qdepth; + uint64_t qmax = mg->mg_max_alloc_queue_depth; + + if (!mc->mc_alloc_throttle_enabled) + return (B_TRUE); + + /* + * If this metaslab group does not have any free space, then + * there is no point in looking further. + */ + if (mg->mg_no_free_space) + return (B_FALSE); + + qdepth = refcount_count(&mg->mg_alloc_queue_depth); + + /* + * If this metaslab group is below its qmax or it's + * the only allocatable metasable group, then attempt + * to allocate from it. + */ + if (qdepth < qmax || mc->mc_alloc_groups == 1) + return (B_TRUE); + ASSERT3U(mc->mc_alloc_groups, >, 1); + + /* + * Since this metaslab group is at or over its qmax, we + * need to determine if there are metaslab groups after this + * one that might be able to handle this allocation. This is + * racy since we can't hold the locks for all metaslab + * groups at the same time when we make this check. + */ + for (mgp = mg->mg_next; mgp != rotor; mgp = mgp->mg_next) { + qmax = mgp->mg_max_alloc_queue_depth; + + qdepth = refcount_count(&mgp->mg_alloc_queue_depth); + + /* + * If there is another metaslab group that + * might be able to handle the allocation, then + * we return false so that we skip this group. + */ + if (qdepth < qmax && !mgp->mg_no_free_space) + return (B_FALSE); + } + + /* + * We didn't find another group to handle the allocation + * so we can't skip this metaslab group even though + * we are at or over our qmax. + */ + return (B_TRUE); + + } else if (mc->mc_alloc_groups == 0 || psize == SPA_MINBLOCKSIZE) { + return (B_TRUE); + } + return (B_FALSE); } /* @@ -2130,8 +2230,57 @@ metaslab_distance(metaslab_t *msp, dva_t return (0); } +/* + * ========================================================================== + * Metaslab block operations + * ========================================================================== + */ + +static void +metaslab_group_alloc_increment(spa_t *spa, uint64_t vdev, void *tag, int flags) +{ + if (!(flags & METASLAB_ASYNC_ALLOC) || + flags & METASLAB_DONT_THROTTLE) + return; + + metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg; + if (!mg->mg_class->mc_alloc_throttle_enabled) + return; + + (void) refcount_add(&mg->mg_alloc_queue_depth, tag); +} + +void +metaslab_group_alloc_decrement(spa_t *spa, uint64_t vdev, void *tag, int flags) +{ + if (!(flags & METASLAB_ASYNC_ALLOC) || + flags & METASLAB_DONT_THROTTLE) + return; + + metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg; + if (!mg->mg_class->mc_alloc_throttle_enabled) + return; + + (void) refcount_remove(&mg->mg_alloc_queue_depth, tag); +} + +void +metaslab_group_alloc_verify(spa_t *spa, const blkptr_t *bp, void *tag) +{ +#ifdef ZFS_DEBUG + const dva_t *dva = bp->blk_dva; + int ndvas = BP_GET_NDVAS(bp); + + for (int d = 0; d < ndvas; d++) { + uint64_t vdev = DVA_GET_VDEV(&dva[d]); + metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg; + VERIFY(refcount_not_held(&mg->mg_alloc_queue_depth, tag)); + } +#endif +} + static uint64_t -metaslab_group_alloc(metaslab_group_t *mg, uint64_t psize, uint64_t asize, +metaslab_group_alloc(metaslab_group_t *mg, uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d) { spa_t *spa = mg->mg_vd->vdev_spa; @@ -2158,10 +2307,10 @@ metaslab_group_alloc(metaslab_group_t *m if (msp->ms_weight < asize) { spa_dbgmsg(spa, "%s: failed to meet weight " "requirement: vdev %llu, txg %llu, mg %p, " - "msp %p, psize %llu, asize %llu, " + "msp %p, asize %llu, " "weight %llu", spa_name(spa), mg->mg_vd->vdev_id, txg, - mg, msp, psize, asize, msp->ms_weight); + mg, msp, asize, msp->ms_weight); mutex_exit(&mg->mg_lock); return (-1ULL); } @@ -2243,7 +2392,6 @@ metaslab_group_alloc(metaslab_group_t *m msp->ms_access_txg = txg + metaslab_unload_delay; mutex_exit(&msp->ms_lock); - return (offset); } @@ -2260,7 +2408,6 @@ metaslab_alloc_dva(spa_t *spa, metaslab_ int all_zero; int zio_lock = B_FALSE; boolean_t allocatable; - uint64_t offset = -1ULL; uint64_t asize; uint64_t distance; @@ -2330,7 +2477,6 @@ top: all_zero = B_TRUE; do { ASSERT(mg->mg_activation_count == 1); - vd = mg->mg_vd; /* @@ -2346,24 +2492,23 @@ top: /* * Determine if the selected metaslab group is eligible - * for allocations. If we're ganging or have requested - * an allocation for the smallest gang block size - * then we don't want to avoid allocating to the this - * metaslab group. If we're in this condition we should - * try to allocate from any device possible so that we - * don't inadvertently return ENOSPC and suspend the pool + * for allocations. If we're ganging then don't allow + * this metaslab group to skip allocations since that would + * inadvertently return ENOSPC and suspend the pool * even though space is still available. */ - if (allocatable && CAN_FASTGANG(flags) && - psize > SPA_GANGBLOCKSIZE) - allocatable = metaslab_group_allocatable(mg); + if (allocatable && !GANG_ALLOCATION(flags) && !zio_lock) { + allocatable = metaslab_group_allocatable(mg, rotor, + psize); + } if (!allocatable) goto next; + ASSERT(mg->mg_initialized); + /* - * Avoid writing single-copy data to a failing vdev - * unless the user instructs us that it is okay. + * Avoid writing single-copy data to a failing vdev. */ if ((vd->vdev_stat.vs_write_errors > 0 || vd->vdev_state < VDEV_STATE_HEALTHY) && @@ -2383,8 +2528,32 @@ top: asize = vdev_psize_to_asize(vd, psize); ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0); - offset = metaslab_group_alloc(mg, psize, asize, txg, distance, - dva, d); + uint64_t offset = metaslab_group_alloc(mg, asize, txg, + distance, dva, d); + + mutex_enter(&mg->mg_lock); + if (offset == -1ULL) { + mg->mg_failed_allocations++; + if (asize == SPA_GANGBLOCKSIZE) { + /* + * This metaslab group was unable to allocate + * the minimum gang block size so it must be + * out of space. We must notify the allocation + * throttle to start skipping allocation + * attempts to this metaslab group until more + * space becomes available. + * + * Note: this failure cannot be caused by the + * allocation throttle since the allocation + * throttle is only responsible for skipping + * devices and not failing block allocations. + */ + mg->mg_no_free_space = B_TRUE; + } + } + mg->mg_allocations++; + mutex_exit(&mg->mg_lock); + if (offset != -1ULL) { /* * If we've just selected this metaslab group, @@ -2565,9 +2734,57 @@ metaslab_claim_dva(spa_t *spa, const dva return (0); } +/* + * Reserve some allocation slots. The reservation system must be called + * before we call into the allocator. If there aren't any available slots + * then the I/O will be throttled until an I/O completes and its slots are + * freed up. The function returns true if it was successful in placing + * the reservation. + */ +boolean_t +metaslab_class_throttle_reserve(metaslab_class_t *mc, int slots, zio_t *zio, + int flags) +{ + uint64_t available_slots = 0; + boolean_t slot_reserved = B_FALSE; + + ASSERT(mc->mc_alloc_throttle_enabled); + mutex_enter(&mc->mc_lock); + + uint64_t reserved_slots = refcount_count(&mc->mc_alloc_slots); + if (reserved_slots < mc->mc_alloc_max_slots) + available_slots = mc->mc_alloc_max_slots - reserved_slots; + + if (slots <= available_slots || GANG_ALLOCATION(flags)) { + /* + * We reserve the slots individually so that we can unreserve + * them individually when an I/O completes. + */ + for (int d = 0; d < slots; d++) { + reserved_slots = refcount_add(&mc->mc_alloc_slots, zio); + } + zio->io_flags |= ZIO_FLAG_IO_ALLOCATING; + slot_reserved = B_TRUE; + } + + mutex_exit(&mc->mc_lock); + return (slot_reserved); +} + +void +metaslab_class_throttle_unreserve(metaslab_class_t *mc, int slots, zio_t *zio) +{ + ASSERT(mc->mc_alloc_throttle_enabled); + mutex_enter(&mc->mc_lock); + for (int d = 0; d < slots; d++) { + (void) refcount_remove(&mc->mc_alloc_slots, zio); + } + mutex_exit(&mc->mc_lock); +} + int metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp, - int ndvas, uint64_t txg, blkptr_t *hintbp, int flags) + int ndvas, uint64_t txg, blkptr_t *hintbp, int flags, zio_t *zio) { dva_t *dva = bp->blk_dva; dva_t *hintdva = hintbp->blk_dva; @@ -2593,11 +2810,21 @@ metaslab_alloc(spa_t *spa, metaslab_clas if (error != 0) { for (d--; d >= 0; d--) { metaslab_free_dva(spa, &dva[d], txg, B_TRUE); + metaslab_group_alloc_decrement(spa, + DVA_GET_VDEV(&dva[d]), zio, flags); bzero(&dva[d], sizeof (dva_t)); } spa_config_exit(spa, SCL_ALLOC, FTAG); return (error); + } else { + /* + * Update the metaslab group's queue depth + * based on the newly allocated dva. + */ + metaslab_group_alloc_increment(spa, + DVA_GET_VDEV(&dva[d]), zio, flags); } + } ASSERT(error == 0); ASSERT(BP_GET_NDVAS(bp) == ndvas); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/refcount.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/refcount.c Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/refcount.c Sat Sep 3 10:04:37 2016 (r305331) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2012 by Delphix. All rights reserved. + * Copyright (c) 2012, 2015 by Delphix. All rights reserved. */ #include @@ -73,6 +73,13 @@ refcount_create(refcount_t *rc) } void +refcount_create_tracked(refcount_t *rc) +{ + refcount_create(rc); + rc->rc_tracked = B_TRUE; +} + +void refcount_create_untracked(refcount_t *rc) { refcount_create(rc); @@ -255,4 +262,60 @@ refcount_transfer_ownership(refcount_t * ASSERT(found); mutex_exit(&rc->rc_mtx); } + +/* + * If tracking is enabled, return true if a reference exists that matches + * the "holder" tag. If tracking is disabled, then return true if a reference + * might be held. + */ +boolean_t +refcount_held(refcount_t *rc, void *holder) +{ + reference_t *ref; + + mutex_enter(&rc->rc_mtx); + + if (!rc->rc_tracked) { + mutex_exit(&rc->rc_mtx); + return (rc->rc_count > 0); + } + + for (ref = list_head(&rc->rc_list); ref; + ref = list_next(&rc->rc_list, ref)) { + if (ref->ref_holder == holder) { + mutex_exit(&rc->rc_mtx); + return (B_TRUE); + } + } + mutex_exit(&rc->rc_mtx); + return (B_FALSE); +} + +/* + * If tracking is enabled, return true if a reference does not exist that + * matches the "holder" tag. If tracking is disabled, always return true + * since the reference might not be held. + */ +boolean_t +refcount_not_held(refcount_t *rc, void *holder) +{ + reference_t *ref; + + mutex_enter(&rc->rc_mtx); + + if (!rc->rc_tracked) { + mutex_exit(&rc->rc_mtx); + return (B_TRUE); + } + + for (ref = list_head(&rc->rc_list); ref; + ref = list_next(&rc->rc_list, ref)) { + if (ref->ref_holder == holder) { + mutex_exit(&rc->rc_mtx); + return (B_FALSE); + } + } + mutex_exit(&rc->rc_mtx); + return (B_TRUE); +} #endif /* ZFS_DEBUG */ Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Sat Sep 3 10:04:37 2016 (r305331) @@ -1332,7 +1332,6 @@ spa_unload(spa_t *spa) ddt_unload(spa); - /* * Drop and purge level 2 cache */ @@ -3720,6 +3719,7 @@ spa_create(const char *pool, nvlist_t *n spa->spa_uberblock.ub_txg = txg - 1; spa->spa_uberblock.ub_version = version; spa->spa_ubsync = spa->spa_uberblock; + spa->spa_load_state = SPA_LOAD_CREATE; /* * Create "The Godfather" zio to hold all async IOs @@ -3905,6 +3905,7 @@ spa_create(const char *pool, nvlist_t *n */ spa_evicting_os_wait(spa); spa->spa_minref = refcount_count(&spa->spa_refcount); + spa->spa_load_state = SPA_LOAD_NONE; mutex_exit(&spa_namespace_lock); @@ -5615,7 +5616,7 @@ spa_nvlist_lookup_by_guid(nvlist_t **nvp static void spa_vdev_remove_aux(nvlist_t *config, char *name, nvlist_t **dev, int count, - nvlist_t *dev_to_remove) + nvlist_t *dev_to_remove) { nvlist_t **newdev = NULL; @@ -6830,6 +6831,8 @@ spa_sync(spa_t *spa, uint64_t txg) vdev_t *vd; dmu_tx_t *tx; int error; + uint32_t max_queue_depth = zfs_vdev_async_write_max_active * + zfs_vdev_queue_depth_pct / 100; VERIFY(spa_writeable(spa)); @@ -6841,6 +6844,10 @@ spa_sync(spa_t *spa, uint64_t txg) spa->spa_syncing_txg = txg; spa->spa_sync_pass = 0; + mutex_enter(&spa->spa_alloc_lock); + VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); + mutex_exit(&spa->spa_alloc_lock); + /* * If there are any pending vdev state changes, convert them * into config changes that go out with this transaction group. @@ -6900,6 +6907,38 @@ spa_sync(spa_t *spa, uint64_t txg) } /* + * Set the top-level vdev's max queue depth. Evaluate each + * top-level's async write queue depth in case it changed. + * The max queue depth will not change in the middle of syncing + * out this txg. + */ + uint64_t queue_depth_total = 0; + for (int c = 0; c < rvd->vdev_children; c++) { + vdev_t *tvd = rvd->vdev_child[c]; + metaslab_group_t *mg = tvd->vdev_mg; + + if (mg == NULL || mg->mg_class != spa_normal_class(spa) || + !metaslab_group_initialized(mg)) + continue; + + /* + * It is safe to do a lock-free check here because only async + * allocations look at mg_max_alloc_queue_depth, and async + * allocations all happen from spa_sync(). + */ + ASSERT0(refcount_count(&mg->mg_alloc_queue_depth)); + mg->mg_max_alloc_queue_depth = max_queue_depth; + queue_depth_total += mg->mg_max_alloc_queue_depth; + } + metaslab_class_t *mc = spa_normal_class(spa); + ASSERT0(refcount_count(&mc->mc_alloc_slots)); + mc->mc_alloc_max_slots = queue_depth_total; + mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled; + + ASSERT3U(mc->mc_alloc_max_slots, <=, + max_queue_depth * rvd->vdev_children); + + /* * Iterate to convergence. */ do { @@ -7056,6 +7095,10 @@ spa_sync(spa_t *spa, uint64_t txg) dsl_pool_sync_done(dp, txg); + mutex_enter(&spa->spa_alloc_lock); + VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); + mutex_exit(&spa->spa_alloc_lock); + /* * Update usable space statistics. */ Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c Sat Sep 3 10:04:37 2016 (r305331) @@ -657,6 +657,7 @@ spa_add(const char *name, nvlist_t *conf mutex_init(&spa->spa_scrub_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_suspend_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_vdev_top_lock, NULL, MUTEX_DEFAULT, NULL); + mutex_init(&spa->spa_alloc_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&spa->spa_async_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_evicting_os_cv, NULL, CV_DEFAULT, NULL); @@ -713,6 +714,9 @@ spa_add(const char *name, nvlist_t *conf spa_active_count++; } + avl_create(&spa->spa_alloc_tree, zio_timestamp_compare, + sizeof (zio_t), offsetof(zio_t, io_alloc_node)); + /* * Every pool starts with the default cachefile */ @@ -791,6 +795,7 @@ spa_remove(spa_t *spa) kmem_free(dp, sizeof (spa_config_dirent_t)); } + avl_destroy(&spa->spa_alloc_tree); list_destroy(&spa->spa_config_list); nvlist_free(spa->spa_label_features); @@ -824,6 +829,7 @@ spa_remove(spa_t *spa) cv_destroy(&spa->spa_scrub_io_cv); cv_destroy(&spa->spa_suspend_cv); + mutex_destroy(&spa->spa_alloc_lock); mutex_destroy(&spa->spa_async_lock); mutex_destroy(&spa->spa_errlist_lock); mutex_destroy(&spa->spa_errlog_lock); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h Sat Sep 3 10:04:37 2016 (r305331) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2011, 2014 by Delphix. All rights reserved. + * Copyright (c) 2011, 2015 by Delphix. All rights reserved. */ #ifndef _SYS_METASLAB_H @@ -55,14 +55,15 @@ void metaslab_sync_done(metaslab_t *, ui void metaslab_sync_reassess(metaslab_group_t *); uint64_t metaslab_block_maxsize(metaslab_t *); -#define METASLAB_HINTBP_FAVOR 0x0 -#define METASLAB_HINTBP_AVOID 0x1 -#define METASLAB_GANG_HEADER 0x2 -#define METASLAB_GANG_CHILD 0x4 -#define METASLAB_GANG_AVOID 0x8 +#define METASLAB_HINTBP_FAVOR 0x0 +#define METASLAB_HINTBP_AVOID 0x1 +#define METASLAB_GANG_HEADER 0x2 +#define METASLAB_GANG_CHILD 0x4 +#define METASLAB_ASYNC_ALLOC 0x8 +#define METASLAB_DONT_THROTTLE 0x10 int metaslab_alloc(spa_t *, metaslab_class_t *, uint64_t, - blkptr_t *, int, uint64_t, blkptr_t *, int); + blkptr_t *, int, uint64_t, blkptr_t *, int, zio_t *); void metaslab_free(spa_t *, const blkptr_t *, uint64_t, boolean_t); int metaslab_claim(spa_t *, const blkptr_t *, uint64_t); void metaslab_check_free(spa_t *, const blkptr_t *); @@ -73,6 +74,9 @@ int metaslab_class_validate(metaslab_cla void metaslab_class_histogram_verify(metaslab_class_t *); uint64_t metaslab_class_fragmentation(metaslab_class_t *); uint64_t metaslab_class_expandable_space(metaslab_class_t *); +boolean_t metaslab_class_throttle_reserve(metaslab_class_t *, int, + zio_t *, int); +void metaslab_class_throttle_unreserve(metaslab_class_t *, int, zio_t *); void metaslab_class_space_update(metaslab_class_t *, int64_t, int64_t, int64_t, int64_t); @@ -86,10 +90,13 @@ metaslab_group_t *metaslab_group_create( void metaslab_group_destroy(metaslab_group_t *); void metaslab_group_activate(metaslab_group_t *); void metaslab_group_passivate(metaslab_group_t *); +boolean_t metaslab_group_initialized(metaslab_group_t *); uint64_t metaslab_group_get_space(metaslab_group_t *); void metaslab_group_histogram_verify(metaslab_group_t *); uint64_t metaslab_group_fragmentation(metaslab_group_t *); void metaslab_group_histogram_remove(metaslab_group_t *, metaslab_t *); +void metaslab_group_alloc_decrement(spa_t *, uint64_t, void *, int); +void metaslab_group_alloc_verify(spa_t *, const blkptr_t *, void *); #ifdef __cplusplus } Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h Sat Sep 3 10:04:37 2016 (r305331) @@ -24,7 +24,7 @@ */ /* - * Copyright (c) 2011, 2014 by Delphix. All rights reserved. + * Copyright (c) 2011, 2015 by Delphix. All rights reserved. */ #ifndef _SYS_METASLAB_IMPL_H @@ -59,11 +59,42 @@ extern "C" { * to use a block allocator that best suits that class. */ struct metaslab_class { + kmutex_t mc_lock; spa_t *mc_spa; metaslab_group_t *mc_rotor; metaslab_ops_t *mc_ops; uint64_t mc_aliquot; + + /* + * Track the number of metaslab groups that have been initialized + * and can accept allocations. An initialized metaslab group is + * one has been completely added to the config (i.e. we have + * updated the MOS config and the space has been added to the pool). + */ + uint64_t mc_groups; + + /* + * Toggle to enable/disable the allocation throttle. + */ + boolean_t mc_alloc_throttle_enabled; + + /* + * The allocation throttle works on a reservation system. Whenever + * an asynchronous zio wants to perform an allocation it must + * first reserve the number of blocks that it wants to allocate. + * If there aren't sufficient slots available for the pending zio + * then that I/O is throttled until more slots free up. The current + * number of reserved allocations is maintained by the mc_alloc_slots + * refcount. The mc_alloc_max_slots value determines the maximum + * number of allocations that the system allows. Gang blocks are + * allowed to reserve slots even if we've reached the maximum + * number of allocations allowed. + */ + uint64_t mc_alloc_max_slots; + refcount_t mc_alloc_slots; + uint64_t mc_alloc_groups; /* # of allocatable groups */ + uint64_t mc_alloc; /* total allocated space */ uint64_t mc_deferred; /* total deferred frees */ uint64_t mc_space; /* total space (alloc + free) */ @@ -86,6 +117,15 @@ struct metaslab_group { avl_tree_t mg_metaslab_tree; uint64_t mg_aliquot; boolean_t mg_allocatable; /* can we allocate? */ + + /* + * A metaslab group is considered to be initialized only after + * we have updated the MOS config and added the space to the pool. + * We only allow allocation attempts to a metaslab group if it + * has been initialized. + */ + boolean_t mg_initialized; + uint64_t mg_free_capacity; /* percentage free */ int64_t mg_bias; int64_t mg_activation_count; @@ -94,6 +134,27 @@ struct metaslab_group { taskq_t *mg_taskq; metaslab_group_t *mg_prev; metaslab_group_t *mg_next; + + /* + * Each metaslab group can handle mg_max_alloc_queue_depth allocations + * which are tracked by mg_alloc_queue_depth. It's possible for a + * metaslab group to handle more allocations than its max. This + * can occur when gang blocks are required or when other groups + * are unable to handle their share of allocations. + */ + uint64_t mg_max_alloc_queue_depth; + refcount_t mg_alloc_queue_depth; + + /* + * A metalab group that can no longer allocate the minimum block + * size will set mg_no_free_space. Once a metaslab group is out + * of space then its share of work must be distributed to other + * groups. + */ + boolean_t mg_no_free_space; + + uint64_t mg_allocations; + uint64_t mg_failed_allocations; uint64_t mg_fragmentation; uint64_t mg_histogram[RANGE_TREE_HISTOGRAM_SIZE]; }; Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/refcount.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/refcount.h Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/refcount.h Sat Sep 3 10:04:37 2016 (r305331) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2012 by Delphix. All rights reserved. + * Copyright (c) 2012, 2015 by Delphix. All rights reserved. */ #ifndef _SYS_REFCOUNT_H @@ -64,6 +64,7 @@ typedef struct refcount { void refcount_create(refcount_t *rc); void refcount_create_untracked(refcount_t *rc); +void refcount_create_tracked(refcount_t *rc); void refcount_destroy(refcount_t *rc); void refcount_destroy_many(refcount_t *rc, uint64_t number); int refcount_is_zero(refcount_t *rc); @@ -74,6 +75,8 @@ int64_t refcount_add_many(refcount_t *rc int64_t refcount_remove_many(refcount_t *rc, uint64_t number, void *holder_tag); void refcount_transfer(refcount_t *dst, refcount_t *src); void refcount_transfer_ownership(refcount_t *, void *, void *); +boolean_t refcount_held(refcount_t *, void *); +boolean_t refcount_not_held(refcount_t *, void *); void refcount_sysinit(void); void refcount_fini(void); @@ -86,6 +89,7 @@ typedef struct refcount { #define refcount_create(rc) ((rc)->rc_count = 0) #define refcount_create_untracked(rc) ((rc)->rc_count = 0) +#define refcount_create_tracked(rc) ((rc)->rc_count = 0) #define refcount_destroy(rc) ((rc)->rc_count = 0) #define refcount_destroy_many(rc, number) ((rc)->rc_count = 0) #define refcount_is_zero(rc) ((rc)->rc_count == 0) @@ -102,6 +106,8 @@ typedef struct refcount { atomic_add_64(&(dst)->rc_count, __tmp); \ } #define refcount_transfer_ownership(rc, current_holder, new_holder) +#define refcount_held(rc, holder) ((rc)->rc_count > 0) +#define refcount_not_held(rc, holder) (B_TRUE) #define refcount_sysinit() #define refcount_fini() Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h Sat Sep 3 10:04:37 2016 (r305331) @@ -165,6 +165,8 @@ struct spa { uint64_t spa_last_synced_guid; /* last synced guid */ list_t spa_config_dirty_list; /* vdevs with dirty config */ list_t spa_state_dirty_list; /* vdevs with dirty state */ + kmutex_t spa_alloc_lock; + avl_tree_t spa_alloc_tree; spa_aux_vdev_t spa_spares; /* hot spares */ spa_aux_vdev_t spa_l2cache; /* L2ARC cache devices */ nvlist_t *spa_label_features; /* Features for reading MOS */ Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h Sat Sep 3 10:04:37 2016 (r305331) @@ -53,6 +53,9 @@ typedef struct vdev_queue vdev_queue_t; typedef struct vdev_cache vdev_cache_t; typedef struct vdev_cache_entry vdev_cache_entry_t; +extern int zfs_vdev_queue_depth_pct; +extern uint32_t zfs_vdev_async_write_max_active; + /* * Virtual device operations */ @@ -190,10 +193,21 @@ struct vdev { uint64_t vdev_deflate_ratio; /* deflation ratio (x512) */ uint64_t vdev_islog; /* is an intent log device */ uint64_t vdev_removing; /* device is being removed? */ - boolean_t vdev_ishole; /* is a hole in the namespace */ + boolean_t vdev_ishole; /* is a hole in the namespace */ + kmutex_t vdev_queue_lock; /* protects vdev_queue_depth */ uint64_t vdev_top_zap; /* + * The queue depth parameters determine how many async writes are + * still pending (i.e. allocated by net yet issued to disk) per + * top-level (vdev_async_write_queue_depth) and the maximum allowed + * (vdev_max_async_write_queue_depth). These values only apply to + * top-level vdevs. + */ + uint64_t vdev_async_write_queue_depth; + uint64_t vdev_max_async_write_queue_depth; + + /* * Leaf vdev state. */ range_tree_t *vdev_dtl[DTL_TYPES]; /* dirty time logs */ Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h Sat Sep 3 09:24:33 2016 (r305330) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h Sat Sep 3 10:04:37 2016 (r305331) @@ -175,6 +175,7 @@ enum zio_flag { ZIO_FLAG_DONT_CACHE = 1 << 11, ZIO_FLAG_NODATA = 1 << 12, ZIO_FLAG_INDUCE_DAMAGE = 1 << 13, + ZIO_FLAG_IO_ALLOCATING = 1 << 14, #define ZIO_FLAG_DDT_INHERIT (ZIO_FLAG_IO_RETRY - 1) #define ZIO_FLAG_GANG_INHERIT (ZIO_FLAG_IO_RETRY - 1) @@ -182,27 +183,27 @@ enum zio_flag { /* * Flags inherited by vdev children. */ - ZIO_FLAG_IO_RETRY = 1 << 14, /* must be first for INHERIT */ - ZIO_FLAG_PROBE = 1 << 15, - ZIO_FLAG_TRYHARD = 1 << 16, - ZIO_FLAG_OPTIONAL = 1 << 17, + ZIO_FLAG_IO_RETRY = 1 << 15, /* must be first for INHERIT */ + ZIO_FLAG_PROBE = 1 << 16, + ZIO_FLAG_TRYHARD = 1 << 17, + ZIO_FLAG_OPTIONAL = 1 << 18, *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***