Date: Mon, 16 Apr 2018 03:52:54 +0000 (UTC) From: Alexander Motin <mav@FreeBSD.org> To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org Subject: svn commit: r332540 - in stable/11: cddl/contrib/opensolaris/lib/libzpool/common/sys sys/cddl/contrib/opensolaris/uts/common sys/cddl/contrib/opensolaris/uts/common/fs/zfs sys/cddl/contrib/opensola... Message-ID: <201804160352.w3G3qsqA020776@repo.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: mav Date: Mon Apr 16 03:52:54 2018 New Revision: 332540 URL: https://svnweb.freebsd.org/changeset/base/332540 Log: MFC r331404: MFV r331400: 8484 Implement aggregate sum and use for arc counters In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> Added: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash - copied unchanged from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash.descrip - copied unchanged from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash.descrip stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/aggsum.c - copied unchanged from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/aggsum.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/cityhash.c - copied unchanged from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/cityhash.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/aggsum.h - copied unchanged from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/aggsum.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/cityhash.h - copied unchanged from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/cityhash.h Modified: stable/11/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h stable/11/sys/cddl/contrib/opensolaris/uts/common/Makefile.files stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_context.h stable/11/sys/conf/files Directory Properties: stable/11/ (props changed) Modified: stable/11/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h ============================================================================== --- stable/11/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h Mon Apr 16 03:49:27 2018 (r332539) +++ stable/11/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h Mon Apr 16 03:52:54 2018 (r332540) @@ -559,6 +559,7 @@ extern void delay(clock_t ticks); } while (0); #define max_ncpus 64 +#define boot_ncpus (sysconf(_SC_NPROCESSORS_ONLN)) #define minclsyspri 60 #define maxclsyspri 99 Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/Makefile.files ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/Makefile.files Mon Apr 16 03:49:27 2018 (r332539) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/Makefile.files Mon Apr 16 03:52:54 2018 (r332540) @@ -63,12 +63,14 @@ LUA_OBJS += \ ZFS_COMMON_OBJS += \ abd.o \ + aggsum.o \ arc.o \ bplist.o \ blkptr.o \ bpobj.o \ bptree.o \ bqueue.o \ + cityhash.o \ dbuf.o \ ddt.o \ ddt_zap.o \ Copied: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash (from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash) ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash Mon Apr 16 03:52:54 2018 (r332540, copy of r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash) @@ -0,0 +1,19 @@ +Copyright (c) 2011 Google, Inc. + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. Copied: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash.descrip (from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash.descrip) ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash.descrip Mon Apr 16 03:52:54 2018 (r332540, copy of r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/THIRDPARTYLICENSE.cityhash.descrip) @@ -0,0 +1 @@ +CITYHASH CHECKSUM FUNCTIONALITY IN ZFS Copied: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/aggsum.c (from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/aggsum.c) ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/aggsum.c Mon Apr 16 03:52:54 2018 (r332540, copy of r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/aggsum.c) @@ -0,0 +1,232 @@ +/* + * CDDL HEADER START + * + * This file and its contents are supplied under the terms of the + * Common Development and Distribution License ("CDDL"), version 1.0. + * You may only use this file in accordance with the terms of version + * 1.0 of the CDDL. + * + * A full copy of the text of the CDDL should have accompanied this + * source. A copy of the CDDL is also available via the Internet at + * http://www.illumos.org/license/CDDL. + * + * CDDL HEADER END + */ +/* + * Copyright (c) 2017 by Delphix. All rights reserved. + */ + +#include <sys/zfs_context.h> +#include <sys/aggsum.h> + +/* + * Aggregate-sum counters are a form of fanned-out counter, used when atomic + * instructions on a single field cause enough CPU cache line contention to + * slow system performance. Due to their increased overhead and the expense + * involved with precisely reading from them, they should only be used in cases + * where the write rate (increment/decrement) is much higher than the read rate + * (get value). + * + * Aggregate sum counters are comprised of two basic parts, the core and the + * buckets. The core counter contains a lock for the entire counter, as well + * as the current upper and lower bounds on the value of the counter. The + * aggsum_bucket structure contains a per-bucket lock to protect the contents of + * the bucket, the current amount that this bucket has changed from the global + * counter (called the delta), and the amount of increment and decrement we have + * "borrowed" from the core counter. + * + * The basic operation of an aggsum is simple. Threads that wish to modify the + * counter will modify one bucket's counter (determined by their current CPU, to + * help minimize lock and cache contention). If the bucket already has + * sufficient capacity borrowed from the core structure to handle their request, + * they simply modify the delta and return. If the bucket does not, we clear + * the bucket's current state (to prevent the borrowed amounts from getting too + * large), and borrow more from the core counter. Borrowing is done by adding to + * the upper bound (or subtracting from the lower bound) of the core counter, + * and setting the borrow value for the bucket to the amount added (or + * subtracted). Clearing the bucket is the opposite; we add the current delta + * to both the lower and upper bounds of the core counter, subtract the borrowed + * incremental from the upper bound, and add the borrowed decrement from the + * lower bound. Note that only borrowing and clearing require access to the + * core counter; since all other operations access CPU-local resources, + * performance can be much higher than a traditional counter. + * + * Threads that wish to read from the counter have a slightly more challenging + * task. It is fast to determine the upper and lower bounds of the aggum; this + * does not require grabbing any locks. This suffices for cases where an + * approximation of the aggsum's value is acceptable. However, if one needs to + * know whether some specific value is above or below the current value in the + * aggsum, they invoke aggsum_compare(). This function operates by repeatedly + * comparing the target value to the upper and lower bounds of the aggsum, and + * then clearing a bucket. This proceeds until the target is outside of the + * upper and lower bounds and we return a response, or the last bucket has been + * cleared and we know that the target is equal to the aggsum's value. Finally, + * the most expensive operation is determining the precise value of the aggsum. + * To do this, we clear every bucket and then return the upper bound (which must + * be equal to the lower bound). What makes aggsum_compare() and aggsum_value() + * expensive is clearing buckets. This involves grabbing the global lock + * (serializing against themselves and borrow operations), grabbing a bucket's + * lock (preventing threads on those CPUs from modifying their delta), and + * zeroing out the borrowed value (forcing that thread to borrow on its next + * request, which will also be expensive). This is what makes aggsums well + * suited for write-many read-rarely operations. + */ + +/* + * We will borrow aggsum_borrow_multiplier times the current request, so we will + * have to get the as_lock approximately every aggsum_borrow_multiplier calls to + * aggsum_delta(). + */ +static uint_t aggsum_borrow_multiplier = 10; + +void +aggsum_init(aggsum_t *as, uint64_t value) +{ + bzero(as, sizeof (*as)); + as->as_lower_bound = as->as_upper_bound = value; + mutex_init(&as->as_lock, NULL, MUTEX_DEFAULT, NULL); + as->as_numbuckets = boot_ncpus; + as->as_buckets = kmem_zalloc(boot_ncpus * sizeof (aggsum_bucket_t), + KM_SLEEP); + for (int i = 0; i < as->as_numbuckets; i++) { + mutex_init(&as->as_buckets[i].asc_lock, + NULL, MUTEX_DEFAULT, NULL); + } +} + +void +aggsum_fini(aggsum_t *as) +{ + for (int i = 0; i < as->as_numbuckets; i++) + mutex_destroy(&as->as_buckets[i].asc_lock); + mutex_destroy(&as->as_lock); +} + +int64_t +aggsum_lower_bound(aggsum_t *as) +{ + return (as->as_lower_bound); +} + +int64_t +aggsum_upper_bound(aggsum_t *as) +{ + return (as->as_upper_bound); +} + +static void +aggsum_flush_bucket(aggsum_t *as, struct aggsum_bucket *asb) +{ + ASSERT(MUTEX_HELD(&as->as_lock)); + ASSERT(MUTEX_HELD(&asb->asc_lock)); + + /* + * We use atomic instructions for this because we read the upper and + * lower bounds without the lock, so we need stores to be atomic. + */ + atomic_add_64((volatile uint64_t *)&as->as_lower_bound, asb->asc_delta); + atomic_add_64((volatile uint64_t *)&as->as_upper_bound, asb->asc_delta); + asb->asc_delta = 0; + atomic_add_64((volatile uint64_t *)&as->as_upper_bound, + -asb->asc_borrowed); + atomic_add_64((volatile uint64_t *)&as->as_lower_bound, + asb->asc_borrowed); + asb->asc_borrowed = 0; +} + +uint64_t +aggsum_value(aggsum_t *as) +{ + int64_t rv; + + mutex_enter(&as->as_lock); + if (as->as_lower_bound == as->as_upper_bound) { + rv = as->as_lower_bound; + for (int i = 0; i < as->as_numbuckets; i++) { + ASSERT0(as->as_buckets[i].asc_delta); + ASSERT0(as->as_buckets[i].asc_borrowed); + } + mutex_exit(&as->as_lock); + return (rv); + } + for (int i = 0; i < as->as_numbuckets; i++) { + struct aggsum_bucket *asb = &as->as_buckets[i]; + mutex_enter(&asb->asc_lock); + aggsum_flush_bucket(as, asb); + mutex_exit(&asb->asc_lock); + } + VERIFY3U(as->as_lower_bound, ==, as->as_upper_bound); + rv = as->as_lower_bound; + mutex_exit(&as->as_lock); + + return (rv); +} + +static void +aggsum_borrow(aggsum_t *as, int64_t delta, struct aggsum_bucket *asb) +{ + int64_t abs_delta = (delta < 0 ? -delta : delta); + mutex_enter(&as->as_lock); + mutex_enter(&asb->asc_lock); + + aggsum_flush_bucket(as, asb); + + atomic_add_64((volatile uint64_t *)&as->as_upper_bound, abs_delta); + atomic_add_64((volatile uint64_t *)&as->as_lower_bound, -abs_delta); + asb->asc_borrowed = abs_delta; + + mutex_exit(&asb->asc_lock); + mutex_exit(&as->as_lock); +} + +void +aggsum_add(aggsum_t *as, int64_t delta) +{ + struct aggsum_bucket *asb = + &as->as_buckets[CPU_SEQID % as->as_numbuckets]; + + for (;;) { + mutex_enter(&asb->asc_lock); + if (asb->asc_delta + delta <= (int64_t)asb->asc_borrowed && + asb->asc_delta + delta >= -(int64_t)asb->asc_borrowed) { + asb->asc_delta += delta; + mutex_exit(&asb->asc_lock); + return; + } + mutex_exit(&asb->asc_lock); + aggsum_borrow(as, delta * aggsum_borrow_multiplier, asb); + } +} + +/* + * Compare the aggsum value to target efficiently. Returns -1 if the value + * represented by the aggsum is less than target, 1 if it's greater, and 0 if + * they are equal. + */ +int +aggsum_compare(aggsum_t *as, uint64_t target) +{ + if (as->as_upper_bound < target) + return (-1); + if (as->as_lower_bound > target) + return (1); + mutex_enter(&as->as_lock); + for (int i = 0; i < as->as_numbuckets; i++) { + struct aggsum_bucket *asb = &as->as_buckets[i]; + mutex_enter(&asb->asc_lock); + aggsum_flush_bucket(as, asb); + mutex_exit(&asb->asc_lock); + if (as->as_upper_bound < target) { + mutex_exit(&as->as_lock); + return (-1); + } + if (as->as_lower_bound > target) { + mutex_exit(&as->as_lock); + return (1); + } + } + VERIFY3U(as->as_lower_bound, ==, as->as_upper_bound); + ASSERT3U(as->as_lower_bound, ==, target); + mutex_exit(&as->as_lock); + return (0); +} Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Mon Apr 16 03:49:27 2018 (r332539) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Mon Apr 16 03:52:54 2018 (r332540) @@ -275,6 +275,8 @@ #include <sys/trim_map.h> #include <zfs_fletcher.h> #include <sys/sdt.h> +#include <sys/aggsum.h> +#include <sys/cityhash.h> #include <machine/vmparam.h> @@ -561,6 +563,7 @@ typedef struct arc_stats { kstat_named_t arcstat_c; kstat_named_t arcstat_c_min; kstat_named_t arcstat_c_max; + /* Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_size; /* * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd. @@ -589,12 +592,14 @@ typedef struct arc_stats { * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only * caches), and arc_buf_t structures (allocated via arc_buf_t * cache). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_hdr_size; /* * Number of bytes consumed by ARC buffers of type equal to * ARC_BUFC_DATA. This is generally consumed by buffers backing * on disk user data (e.g. plain file contents). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_data_size; /* @@ -602,6 +607,7 @@ typedef struct arc_stats { * ARC_BUFC_METADATA. This is generally consumed by buffers * backing on disk data that is used for internal ZFS * structures (e.g. ZAP, dnode, indirect blocks, etc). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_metadata_size; /* @@ -610,6 +616,7 @@ typedef struct arc_stats { * buffers (allocated directly via zio_buf_* functions), * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t * cache), and dnode_t structures (allocated via dnode_t cache). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_other_size; /* @@ -617,6 +624,7 @@ typedef struct arc_stats { * arc_anon state. This includes *all* buffers in the arc_anon * state; e.g. data, metadata, evictable, and unevictable buffers * are all included in this value. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_anon_size; /* @@ -624,6 +632,7 @@ typedef struct arc_stats { * following criteria: backing buffers of type ARC_BUFC_DATA, * residing in the arc_anon state, and are eligible for eviction * (e.g. have no outstanding holds on the buffer). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_anon_evictable_data; /* @@ -631,6 +640,7 @@ typedef struct arc_stats { * following criteria: backing buffers of type ARC_BUFC_METADATA, * residing in the arc_anon state, and are eligible for eviction * (e.g. have no outstanding holds on the buffer). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_anon_evictable_metadata; /* @@ -638,6 +648,7 @@ typedef struct arc_stats { * arc_mru state. This includes *all* buffers in the arc_mru * state; e.g. data, metadata, evictable, and unevictable buffers * are all included in this value. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mru_size; /* @@ -645,6 +656,7 @@ typedef struct arc_stats { * following criteria: backing buffers of type ARC_BUFC_DATA, * residing in the arc_mru state, and are eligible for eviction * (e.g. have no outstanding holds on the buffer). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mru_evictable_data; /* @@ -652,6 +664,7 @@ typedef struct arc_stats { * following criteria: backing buffers of type ARC_BUFC_METADATA, * residing in the arc_mru state, and are eligible for eviction * (e.g. have no outstanding holds on the buffer). + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mru_evictable_metadata; /* @@ -662,18 +675,21 @@ typedef struct arc_stats { * don't actually have ARC buffers linked off of these headers. * Thus, *if* the headers had associated ARC buffers, these * buffers *would have* consumed this number of bytes. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mru_ghost_size; /* * Number of bytes that *would have been* consumed by ARC * buffers that are eligible for eviction, of type * ARC_BUFC_DATA, and linked off the arc_mru_ghost state. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mru_ghost_evictable_data; /* * Number of bytes that *would have been* consumed by ARC * buffers that are eligible for eviction, of type * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mru_ghost_evictable_metadata; /* @@ -681,36 +697,42 @@ typedef struct arc_stats { * arc_mfu state. This includes *all* buffers in the arc_mfu * state; e.g. data, metadata, evictable, and unevictable buffers * are all included in this value. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mfu_size; /* * Number of bytes consumed by ARC buffers that are eligible for * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu * state. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mfu_evictable_data; /* * Number of bytes consumed by ARC buffers that are eligible for * eviction, of type ARC_BUFC_METADATA, and reside in the * arc_mfu state. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mfu_evictable_metadata; /* * Total number of bytes that *would have been* consumed by ARC * buffers in the arc_mfu_ghost state. See the comment above * arcstat_mru_ghost_size for more details. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mfu_ghost_size; /* * Number of bytes that *would have been* consumed by ARC * buffers that are eligible for eviction, of type * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mfu_ghost_evictable_data; /* * Number of bytes that *would have been* consumed by ARC * buffers that are eligible for eviction, of type * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state. + * Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_mfu_ghost_evictable_metadata; kstat_named_t arcstat_l2_hits; @@ -732,6 +754,7 @@ typedef struct arc_stats { kstat_named_t arcstat_l2_io_error; kstat_named_t arcstat_l2_lsize; kstat_named_t arcstat_l2_psize; + /* Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_l2_hdr_size; kstat_named_t arcstat_l2_write_trylock_fail; kstat_named_t arcstat_l2_write_passed_headroom; @@ -746,6 +769,7 @@ typedef struct arc_stats { kstat_named_t arcstat_l2_write_buffer_list_iter; kstat_named_t arcstat_l2_write_buffer_list_null_iter; kstat_named_t arcstat_memory_throttle_count; + /* Not updated directly; only synced in arc_kstat_update. */ kstat_named_t arcstat_meta_used; kstat_named_t arcstat_meta_limit; kstat_named_t arcstat_meta_max; @@ -905,14 +929,12 @@ static arc_state_t *arc_l2c_only; * the possibility of inconsistency by having shadow copies of the variables, * while still allowing the code to be readable. */ -#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */ #define arc_p ARCSTAT(arcstat_p) /* target size of MRU */ #define arc_c ARCSTAT(arcstat_c) /* target size of cache */ #define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */ #define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */ #define arc_meta_limit ARCSTAT(arcstat_meta_limit) /* max size for metadata */ #define arc_meta_min ARCSTAT(arcstat_meta_min) /* min size for metadata */ -#define arc_meta_used ARCSTAT(arcstat_meta_used) /* size of metadata */ #define arc_meta_max ARCSTAT(arcstat_meta_max) /* max size of metadata */ /* compressed size of entire arc */ @@ -922,6 +944,22 @@ static arc_state_t *arc_l2c_only; /* number of bytes in the arc from arc_buf_t's */ #define arc_overhead_size ARCSTAT(arcstat_overhead_size) +/* + * There are also some ARC variables that we want to export, but that are + * updated so often that having the canonical representation be the statistic + * variable causes a performance bottleneck. We want to use aggsum_t's for these + * instead, but still be able to export the kstat in the same way as before. + * The solution is to always use the aggsum version, except in the kstat update + * callback. + */ +aggsum_t arc_size; +aggsum_t arc_meta_used; +aggsum_t astat_data_size; +aggsum_t astat_metadata_size; +aggsum_t astat_hdr_size; +aggsum_t astat_other_size; +aggsum_t astat_l2_hdr_size; + static int arc_no_grow; /* Don't try to grow cache size */ static uint64_t arc_tempreserve; static uint64_t arc_loaned_bytes; @@ -1433,21 +1471,14 @@ l2arc_trim(const arc_buf_hdr_t *hdr) } } +/* + * We use Cityhash for this. It's fast, and has good hash properties without + * requiring any large static buffers. + */ static uint64_t buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth) { - uint8_t *vdva = (uint8_t *)dva; - uint64_t crc = -1ULL; - int i; - - ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY); - - for (i = 0; i < sizeof (dva_t); i++) - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF]; - - crc ^= (spa>>8) ^ birth; - - return (crc); + return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth)); } #define HDR_EMPTY(hdr) \ @@ -2643,26 +2674,26 @@ arc_space_consume(uint64_t space, arc_space_type_t typ switch (type) { case ARC_SPACE_DATA: - ARCSTAT_INCR(arcstat_data_size, space); + aggsum_add(&astat_data_size, space); break; case ARC_SPACE_META: - ARCSTAT_INCR(arcstat_metadata_size, space); + aggsum_add(&astat_metadata_size, space); break; case ARC_SPACE_OTHER: - ARCSTAT_INCR(arcstat_other_size, space); + aggsum_add(&astat_other_size, space); break; case ARC_SPACE_HDRS: - ARCSTAT_INCR(arcstat_hdr_size, space); + aggsum_add(&astat_hdr_size, space); break; case ARC_SPACE_L2HDRS: - ARCSTAT_INCR(arcstat_l2_hdr_size, space); + aggsum_add(&astat_l2_hdr_size, space); break; } if (type != ARC_SPACE_DATA) - ARCSTAT_INCR(arcstat_meta_used, space); + aggsum_add(&arc_meta_used, space); - atomic_add_64(&arc_size, space); + aggsum_add(&arc_size, space); } void @@ -2672,31 +2703,36 @@ arc_space_return(uint64_t space, arc_space_type_t type switch (type) { case ARC_SPACE_DATA: - ARCSTAT_INCR(arcstat_data_size, -space); + aggsum_add(&astat_data_size, -space); break; case ARC_SPACE_META: - ARCSTAT_INCR(arcstat_metadata_size, -space); + aggsum_add(&astat_metadata_size, -space); break; case ARC_SPACE_OTHER: - ARCSTAT_INCR(arcstat_other_size, -space); + aggsum_add(&astat_other_size, -space); break; case ARC_SPACE_HDRS: - ARCSTAT_INCR(arcstat_hdr_size, -space); + aggsum_add(&astat_hdr_size, -space); break; case ARC_SPACE_L2HDRS: - ARCSTAT_INCR(arcstat_l2_hdr_size, -space); + aggsum_add(&astat_l2_hdr_size, -space); break; } if (type != ARC_SPACE_DATA) { - ASSERT(arc_meta_used >= space); - if (arc_meta_max < arc_meta_used) - arc_meta_max = arc_meta_used; - ARCSTAT_INCR(arcstat_meta_used, -space); + ASSERT(aggsum_compare(&arc_meta_used, space) >= 0); + /* + * We use the upper bound here rather than the precise value + * because the arc_meta_max value doesn't need to be + * precise. It's only consumed by humans via arcstats. + */ + if (arc_meta_max < aggsum_upper_bound(&arc_meta_used)) + arc_meta_max = aggsum_upper_bound(&arc_meta_used); + aggsum_add(&arc_meta_used, -space); } - ASSERT(arc_size >= space); - atomic_add_64(&arc_size, -space); + ASSERT(aggsum_compare(&arc_size, space) >= 0); + aggsum_add(&arc_size, -space); } /* @@ -3898,7 +3934,7 @@ arc_adjust_impl(arc_state_t *state, uint64_t spa, int6 * capped by the arc_meta_limit tunable. */ static uint64_t -arc_adjust_meta(void) +arc_adjust_meta(uint64_t meta_used) { uint64_t total_evicted = 0; int64_t target; @@ -3910,7 +3946,7 @@ arc_adjust_meta(void) * we're over the meta limit more than we're over arc_p, we * evict some from the MRU here, and some from the MFU below. */ - target = MIN((int64_t)(arc_meta_used - arc_meta_limit), + target = MIN((int64_t)(meta_used - arc_meta_limit), (int64_t)(refcount_count(&arc_anon->arcs_size) + refcount_count(&arc_mru->arcs_size) - arc_p)); @@ -3921,8 +3957,9 @@ arc_adjust_meta(void) * below the meta limit, but not so much as to drop us below the * space allotted to the MFU (which is defined as arc_c - arc_p). */ - target = MIN((int64_t)(arc_meta_used - arc_meta_limit), - (int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p))); + target = MIN((int64_t)(meta_used - arc_meta_limit), + (int64_t)(refcount_count(&arc_mfu->arcs_size) - + (arc_c - arc_p))); total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); @@ -4013,12 +4050,14 @@ arc_adjust(void) uint64_t total_evicted = 0; uint64_t bytes; int64_t target; + uint64_t asize = aggsum_value(&arc_size); + uint64_t ameta = aggsum_value(&arc_meta_used); /* * If we're over arc_meta_limit, we want to correct that before * potentially evicting data buffers below. */ - total_evicted += arc_adjust_meta(); + total_evicted += arc_adjust_meta(ameta); /* * Adjust MRU size @@ -4030,9 +4069,9 @@ arc_adjust(void) * the MRU is over arc_p, we'll evict enough to get back to * arc_p here, and then evict more from the MFU below. */ - target = MIN((int64_t)(arc_size - arc_c), + target = MIN((int64_t)(asize - arc_c), (int64_t)(refcount_count(&arc_anon->arcs_size) + - refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p)); + refcount_count(&arc_mru->arcs_size) + ameta - arc_p)); /* * If we're below arc_meta_min, always prefer to evict data. @@ -4043,7 +4082,7 @@ arc_adjust(void) * type, spill over into the next type. */ if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA && - arc_meta_used > arc_meta_min) { + ameta > arc_meta_min) { bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA); total_evicted += bytes; @@ -4076,10 +4115,10 @@ arc_adjust(void) * size back to arc_p, if we're still above the target cache * size, we evict the rest from the MFU. */ - target = arc_size - arc_c; + target = asize - arc_c; if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA && - arc_meta_used > arc_meta_min) { + ameta > arc_meta_min) { bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); total_evicted += bytes; @@ -4180,6 +4219,7 @@ arc_flush(spa_t *spa, boolean_t retry) void arc_shrink(int64_t to_free) { + uint64_t asize = aggsum_value(&arc_size); if (arc_c > arc_c_min) { DTRACE_PROBE4(arc__shrink, uint64_t, arc_c, uint64_t, arc_c_min, uint64_t, arc_p, uint64_t, to_free); @@ -4189,8 +4229,8 @@ arc_shrink(int64_t to_free) arc_c = arc_c_min; atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift)); - if (arc_c > arc_size) - arc_c = MAX(arc_size, arc_c_min); + if (asize < arc_c) + arc_c = MAX(asize, arc_c_min); if (arc_p > arc_c) arc_p = (arc_c >> 1); @@ -4201,8 +4241,8 @@ arc_shrink(int64_t to_free) ASSERT((int64_t)arc_p >= 0); } - if (arc_size > arc_c) { - DTRACE_PROBE2(arc__shrink_adjust, uint64_t, arc_size, + if (asize > arc_c) { + DTRACE_PROBE2(arc__shrink_adjust, uint64_t, asize, uint64_t, arc_c); (void) arc_adjust(); } @@ -4402,7 +4442,7 @@ arc_kmem_reap_now(void) DTRACE_PROBE(arc__kmem_reap_start); #ifdef _KERNEL - if (arc_meta_used >= arc_meta_limit) { + if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) { /* * We are exceeding our meta-data cache limit. * Purge some DNLC entries to release holds on meta-data. @@ -4566,7 +4606,7 @@ arc_reclaim_thread(void *unused __unused) * be helpful and could potentially cause us to enter an * infinite loop. */ - if (arc_size <= arc_c || evicted == 0) { + if (aggsum_compare(&arc_size, arc_c) <= 0|| evicted == 0) { /* * We're either no longer overflowing, or we * can't evict anything more, so we should wake @@ -4699,7 +4739,8 @@ arc_adapt(int bytes, arc_state_t *state) * If we're within (2 * maxblocksize) bytes of the target * cache size, increment the target cache size */ - if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) { + if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) > + 0) { DTRACE_PROBE1(arc__inc_adapt, int, bytes); atomic_add_64(&arc_c, (int64_t)bytes); if (arc_c > arc_c_max) @@ -4723,7 +4764,16 @@ arc_is_overflowing(void) uint64_t overflow = MAX(SPA_MAXBLOCKSIZE, arc_c >> zfs_arc_overflow_shift); - return (arc_size >= arc_c + overflow); + /* + * We just compare the lower bound here for performance reasons. Our + * primary goals are to make sure that the arc never grows without + * bound, and that it can reach its maximum size. This check + * accomplishes both goals. The maximum amount we could run over by is + * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block + * in the ARC. In practice, that's in the tens of MB, which is low + * enough to be safe. + */ + return (aggsum_lower_bound(&arc_size) >= arc_c + overflow); } static abd_t * @@ -4838,7 +4888,8 @@ arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, v * If we are growing the cache, and we are adding anonymous * data, and we have outgrown arc_p, update arc_p */ - if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon && + if (aggsum_compare(&arc_size, arc_c) < 0 && + hdr->b_l1hdr.b_state == arc_anon && (refcount_count(&arc_anon->arcs_size) + refcount_count(&arc_mru->arcs_size) > arc_p)) arc_p = MIN(arc_c, arc_p + size); @@ -6293,6 +6344,15 @@ arc_kstat_update(kstat_t *ksp, int rw) &as->arcstat_mfu_ghost_size, &as->arcstat_mfu_ghost_evictable_data, &as->arcstat_mfu_ghost_evictable_metadata); + + ARCSTAT(arcstat_size) = aggsum_value(&arc_size); + ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used); + ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size); + ARCSTAT(arcstat_metadata_size) = + aggsum_value(&astat_metadata_size); + ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size); + ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size); + ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size); } return (0); @@ -6425,6 +6485,14 @@ arc_state_init(void) refcount_create(&arc_mfu->arcs_size); refcount_create(&arc_mfu_ghost->arcs_size); refcount_create(&arc_l2c_only->arcs_size); + + aggsum_init(&arc_meta_used, 0); + aggsum_init(&arc_size, 0); + aggsum_init(&astat_data_size, 0); + aggsum_init(&astat_metadata_size, 0); + aggsum_init(&astat_hdr_size, 0); + aggsum_init(&astat_other_size, 0); + aggsum_init(&astat_l2_hdr_size, 0); } static void @@ -6529,7 +6597,6 @@ arc_init(void) arc_c = arc_c_max; arc_p = (arc_c >> 1); - arc_size = 0; /* limit meta-data to 1/4 of the arc capacity */ arc_meta_limit = arc_c_max / 4; Copied: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/cityhash.c (from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/cityhash.c) ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/cityhash.c Mon Apr 16 03:52:54 2018 (r332540, copy of r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/cityhash.c) @@ -0,0 +1,63 @@ +// Copyright (c) 2011 Google, Inc. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in +// all copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +// THE SOFTWARE. + +/* + * Copyright (c) 2017 by Delphix. All rights reserved. + */ + +#include <sys/cityhash.h> + +#define HASH_K1 0xb492b66fbe98f273ULL +#define HASH_K2 0x9ae16a3b2f90404fULL + +/* + * Bitwise right rotate. Normally this will compile to a single + * instruction. + */ +static inline uint64_t +rotate(uint64_t val, int shift) +{ + // Avoid shifting by 64: doing so yields an undefined result. + return (shift == 0 ? val : (val >> shift) | (val << (64 - shift))); +} + +static inline uint64_t +cityhash_helper(uint64_t u, uint64_t v, uint64_t mul) +{ + uint64_t a = (u ^ v) * mul; + a ^= (a >> 47); + uint64_t b = (v ^ a) * mul; + b ^= (b >> 47); + b *= mul; + return (b); +} + +uint64_t +cityhash4(uint64_t w1, uint64_t w2, uint64_t w3, uint64_t w4) +{ + uint64_t mul = HASH_K2 + 64; + uint64_t a = w1 * HASH_K1; + uint64_t b = w2; + uint64_t c = w4 * mul; + uint64_t d = w3 * HASH_K2; + return (cityhash_helper(rotate(a + b, 43) + rotate(c, 30) + d, + a + rotate(b + HASH_K2, 18) + c, mul)); + +} Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c Mon Apr 16 03:49:27 2018 (r332539) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c Mon Apr 16 03:52:54 2018 (r332540) @@ -48,6 +48,7 @@ #include <sys/callb.h> #include <sys/abd.h> #include <sys/vdev.h> +#include <sys/cityhash.h> uint_t zfs_dbuf_evict_key; @@ -177,23 +178,14 @@ static dbuf_hash_table_t dbuf_hash_table; static uint64_t dbuf_hash_count; +/* + * We use Cityhash for this. It's fast, and has good hash properties without + * requiring any large static buffers. + */ static uint64_t dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid) { - uintptr_t osv = (uintptr_t)os; - uint64_t crc = -1ULL; - - ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY); - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (lvl)) & 0xFF]; - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (osv >> 6)) & 0xFF]; - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 0)) & 0xFF]; - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 8)) & 0xFF]; - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 0)) & 0xFF]; - crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 8)) & 0xFF]; - - crc ^= (osv>>14) ^ (obj>>16) ^ (blkid>>16); - - return (crc); + return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid)); } #define DBUF_EQUAL(dbuf, os, obj, level, blkid) \ Copied: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/aggsum.h (from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/aggsum.h) ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/aggsum.h Mon Apr 16 03:52:54 2018 (r332540, copy of r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/aggsum.h) @@ -0,0 +1,58 @@ +/* + * CDDL HEADER START + * + * This file and its contents are supplied under the terms of the + * Common Development and Distribution License ("CDDL"), version 1.0. + * You may only use this file in accordance with the terms of version + * 1.0 of the CDDL. + * + * A full copy of the text of the CDDL should have accompanied this + * source. A copy of the CDDL is also available via the Internet at + * http://www.illumos.org/license/CDDL. + * + * CDDL HEADER END + */ +/* + * Copyright (c) 2017 by Delphix. All rights reserved. + */ + +#ifndef _SYS_AGGSUM_H +#define _SYS_AGGSUM_H + +#include <sys/zfs_context.h> + +#ifdef __cplusplus +extern "C" { +#endif + +typedef struct aggsum_bucket { + kmutex_t asc_lock; + int64_t asc_delta; + uint64_t asc_borrowed; + uint64_t asc_pad[4]; /* pad out to cache line (64 bytes) */ +} aggsum_bucket_t __aligned(CACHE_LINE_SIZE); + +/* + * Fan out over FANOUT cpus. + */ +typedef struct aggsum { + kmutex_t as_lock; + int64_t as_lower_bound; + int64_t as_upper_bound; + uint64_t as_numbuckets; + aggsum_bucket_t *as_buckets; +} aggsum_t; + +void aggsum_init(aggsum_t *, uint64_t); +void aggsum_fini(aggsum_t *); +int64_t aggsum_lower_bound(aggsum_t *); +int64_t aggsum_upper_bound(aggsum_t *); +int aggsum_compare(aggsum_t *, uint64_t); +uint64_t aggsum_value(aggsum_t *); +void aggsum_add(aggsum_t *, int64_t); + +#ifdef __cplusplus +} +#endif + +#endif /* _SYS_AGGSUM_H */ Copied: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/cityhash.h (from r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/cityhash.h) ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/cityhash.h Mon Apr 16 03:52:54 2018 (r332540, copy of r331404, head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/cityhash.h) @@ -0,0 +1,41 @@ +// Copyright (c) 2011 Google, Inc. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in +// all copies or substantial portions of the Software. +// *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201804160352.w3G3qsqA020776>