From owner-svn-src-all@FreeBSD.ORG Sun Apr 18 21:36:34 2010 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9E0A1065680; Sun, 18 Apr 2010 21:36:34 +0000 (UTC) (envelope-from pjd@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id C65578FC16; Sun, 18 Apr 2010 21:36:34 +0000 (UTC) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id o3ILaYBX000555; Sun, 18 Apr 2010 21:36:34 GMT (envelope-from pjd@svn.freebsd.org) Received: (from pjd@localhost) by svn.freebsd.org (8.14.3/8.14.3/Submit) id o3ILaYHV000549; Sun, 18 Apr 2010 21:36:34 GMT (envelope-from pjd@svn.freebsd.org) Message-Id: <201004182136.o3ILaYHV000549@svn.freebsd.org> From: Pawel Jakub Dawidek Date: Sun, 18 Apr 2010 21:36:34 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-8@freebsd.org X-SVN-Group: stable-8 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r206815 - in stable/8/sys: boot/zfs cddl/contrib/opensolaris/uts/common/fs/zfs modules/zfs sys X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Apr 2010 21:36:35 -0000 Author: pjd Date: Sun Apr 18 21:36:34 2010 New Revision: 206815 URL: http://svn.freebsd.org/changeset/base/206815 Log: MFC r203504,r204067,r204073,r204101,r204804,r205079,r205080,r205132,r205133, r205134,r205231,r205253,r205264,r205346,r206051,r206667,r206792,r206793, r206794,r206795,r206796,r206797: r203504: Open provider for writting when we find the right one. Opening too much providers for writing provokes huge traffic related to taste events send by GEOM on close. This can lead to various problems with opening GEOM providers that are created on top of other GEOM providers. Reorted by: Kurt Touet , mr Tested by: mr, Baginski Darren r204067: Update comment. We also look for GPT partitions. r204073: Add tunable and sysctl to skip hostid check on pool import. r204101: Don't set f_bsize to recordsize. It might confuse some software (like squid). Submitted by: Alexander Zagrebin r204804: Remove racy assertion. Reported by: Attila Nagy Obtained from: OpenSolaris, Bug ID 6827260 r205079: Remove bogus assertion. Reported by: Johan Ström Obtained from: OpenSolaris, Bug ID 6920880 r205080: Force commit to correct Bug ID: Obtained from: OpenSolaris, Bug ID 6920880 r205132: Don't bottleneck on acquiring the stream locks - this avoids a massive drop off in throughput with large numbers of simultaneous reads r205133: fix compilation under ZIO_USE_UMA r205134: make UMA the default allocator for ZFS buffers - this avoids a great deal of contention in kmem_alloc r205231: - reduce contention by breaking up ARC state locks in to 16 for data and 16 for metadata - export L2ARC tunables as sysctls - add several kstats to track L2ARC state more precisely - avoid holding a contended lock when atomically incrementing a contended counter (no lock protection needed for atomics) r205253: use CACHE_LINE_SIZE instead of hardcoding 128 for lock pad pointed out by Marius Nuennerich and jhb@ r205264: - cache line align arcs_lock array (h/t Marius Nuennerich) - fix ARCS_LOCK_PAD to use architecture defined CACHE_LINE_SIZE - cache line align buf_hash_table ht_locks array r205346: The same code is used to import and to create pool. The order of operations is the following: 1. Try to open vdev by remembered path and guid. 2. If 1 failed, try to find vdev which guid matches and ignore the path. 3. If 2 failed this means either that the vdev we're looking for is gone or that pool is being created and vdev doesn't contain proper guid yet. To be able to handle pool creation we open vdev by path anyway. Because of 3 it is possible that we open wrong vdev on import which can lead to confusions. The solution for this is to check spa_load_state. On pool creation it will be equal to SPA_LOAD_NONE and we can open vdev only by path immediately and if it is not equal to SPA_LOAD_NONE we first open by path+guid and when that fails, we open by guid. We no longer open wrong vdev on import. r206051: IOCPARM_MAX defines maximum size of a structure that can be passed directly to ioctl(2). Because of how ioctl command is build using _IO*() macros we have only 13 bits to encode structure size. So the structure can be up to 8kB-1. Currently we define IOCPARM_MAX as PAGE_SIZE. This is IMHO wrong for three main reasons: 1. It is confusing on archs with page size larger than 8kB (not really sure if we support such archs (sparc64?)), as even if PAGE_SIZE is bigger than 8kB, we won't be able to encode anything larger in ioctl command. 2. It is a waste. Why the structure can be only 4kB on most archs if we have 13 bits dedicated for that, not 12? 3. It shouldn't depend on architecture and page size. My ioctl command can work on one arch, but can't on the other? Increase IOCPARM_MAX to 8kB and make it independed of PAGE_SIZE and architecture it is compiled for. This allows to use all the bits on all the archs for size. Note that this doesn't mean we will copy more on every ioctl(2) call. No. We still copyin(9)/copyout(9) only exact number of bytes encoded in ioctl command. Practical use for this change is ZFS. zfs_cmd_t structure used for ZFS ioctls is larger than 4kB. Silence on: arch@ r206667: Fix 3-way deadlock that can happen because of ZFS and vnode lock order reversal. thread0 (vfs_fhtovp) thread1 (vop_getattr) thread2 (zfs_recv) -------------------- --------------------- ------------------ vn_lock rrw_enter_read rrw_enter_write (hangs) rrw_enter_read (hangs) vn_lock (hangs) Reported by: Attila Nagy r206792: Set ARC_L2_WRITING on L2ARC header creation. Obtained from: OpenSolaris r206793: Remove racy assertion. Obtained from: OpenSolaris r206794: Extend locks scope to match OpenSolaris. r206795: Add missing list and lock destruction. r206796: Style fixes. r206797: Restore previous order. Modified: stable/8/sys/boot/zfs/zfs.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c stable/8/sys/modules/zfs/Makefile stable/8/sys/sys/ioccom.h Directory Properties: stable/8/sys/ (props changed) stable/8/sys/amd64/include/xen/ (props changed) stable/8/sys/cddl/contrib/opensolaris/ (props changed) stable/8/sys/contrib/dev/acpica/ (props changed) stable/8/sys/contrib/dev/uath/ (props changed) stable/8/sys/contrib/pf/ (props changed) stable/8/sys/dev/xen/xenpci/ (props changed) Modified: stable/8/sys/boot/zfs/zfs.c ============================================================================== --- stable/8/sys/boot/zfs/zfs.c Sun Apr 18 21:29:28 2010 (r206814) +++ stable/8/sys/boot/zfs/zfs.c Sun Apr 18 21:36:34 2010 (r206815) @@ -397,7 +397,7 @@ zfs_dev_init(void) /* * Open all the disks we can find and see if we can reconstruct * ZFS pools from them. Bogusly assumes that the disks are named - * diskN or diskNsM. + * diskN, diskNpM or diskNsM. */ zfs_init(); for (unit = 0; unit < 32 /* XXX */; unit++) { Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Sun Apr 18 21:29:28 2010 (r206814) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Sun Apr 18 21:36:34 2010 (r206815) @@ -186,6 +186,11 @@ SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_min, SYSCTL_INT(_vfs_zfs, OID_AUTO, mdcomp_disable, CTLFLAG_RDTUN, &zfs_mdcomp_disable, 0, "Disable metadata compression"); +#ifdef ZIO_USE_UMA +extern kmem_cache_t *zio_buf_cache[]; +extern kmem_cache_t *zio_data_buf_cache[]; +#endif + /* * Note that buffers can be in one of 6 states: * ARC_anon - anonymous (discussed below) @@ -218,13 +223,31 @@ SYSCTL_INT(_vfs_zfs, OID_AUTO, mdcomp_di * second level ARC benefit from these fast lookups. */ +#define ARCS_LOCK_PAD CACHE_LINE_SIZE +struct arcs_lock { + kmutex_t arcs_lock; +#ifdef _KERNEL + unsigned char pad[(ARCS_LOCK_PAD - sizeof (kmutex_t))]; +#endif +}; + +/* + * must be power of two for mask use to work + * + */ +#define ARC_BUFC_NUMDATALISTS 16 +#define ARC_BUFC_NUMMETADATALISTS 16 +#define ARC_BUFC_NUMLISTS (ARC_BUFC_NUMMETADATALISTS + ARC_BUFC_NUMDATALISTS) + typedef struct arc_state { - list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */ uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */ uint64_t arcs_size; /* total amount of data in this state */ - kmutex_t arcs_mtx; + list_t arcs_lists[ARC_BUFC_NUMLISTS]; /* list of evictable buffers */ + struct arcs_lock arcs_locks[ARC_BUFC_NUMLISTS] __aligned(CACHE_LINE_SIZE); } arc_state_t; +#define ARCS_LOCK(s, i) (&((s)->arcs_locks[(i)].arcs_lock)) + /* The 6 states: */ static arc_state_t ARC_anon; static arc_state_t ARC_mru; @@ -248,7 +271,9 @@ typedef struct arc_stats { kstat_named_t arcstat_mru_ghost_hits; kstat_named_t arcstat_mfu_hits; kstat_named_t arcstat_mfu_ghost_hits; + kstat_named_t arcstat_allocated; kstat_named_t arcstat_deleted; + kstat_named_t arcstat_stolen; kstat_named_t arcstat_recycle_miss; kstat_named_t arcstat_mutex_miss; kstat_named_t arcstat_evict_skip; @@ -280,6 +305,19 @@ typedef struct arc_stats { kstat_named_t arcstat_l2_size; kstat_named_t arcstat_l2_hdr_size; kstat_named_t arcstat_memory_throttle_count; + kstat_named_t arcstat_l2_write_trylock_fail; + kstat_named_t arcstat_l2_write_passed_headroom; + kstat_named_t arcstat_l2_write_spa_mismatch; + kstat_named_t arcstat_l2_write_in_l2; + kstat_named_t arcstat_l2_write_hdr_io_in_progress; + kstat_named_t arcstat_l2_write_not_cacheable; + kstat_named_t arcstat_l2_write_full; + kstat_named_t arcstat_l2_write_buffer_iter; + kstat_named_t arcstat_l2_write_pios; + kstat_named_t arcstat_l2_write_bytes_written; + kstat_named_t arcstat_l2_write_buffer_bytes_scanned; + kstat_named_t arcstat_l2_write_buffer_list_iter; + kstat_named_t arcstat_l2_write_buffer_list_null_iter; } arc_stats_t; static arc_stats_t arc_stats = { @@ -297,7 +335,9 @@ static arc_stats_t arc_stats = { { "mru_ghost_hits", KSTAT_DATA_UINT64 }, { "mfu_hits", KSTAT_DATA_UINT64 }, { "mfu_ghost_hits", KSTAT_DATA_UINT64 }, + { "allocated", KSTAT_DATA_UINT64 }, { "deleted", KSTAT_DATA_UINT64 }, + { "stolen", KSTAT_DATA_UINT64 }, { "recycle_miss", KSTAT_DATA_UINT64 }, { "mutex_miss", KSTAT_DATA_UINT64 }, { "evict_skip", KSTAT_DATA_UINT64 }, @@ -328,7 +368,20 @@ static arc_stats_t arc_stats = { { "l2_io_error", KSTAT_DATA_UINT64 }, { "l2_size", KSTAT_DATA_UINT64 }, { "l2_hdr_size", KSTAT_DATA_UINT64 }, - { "memory_throttle_count", KSTAT_DATA_UINT64 } + { "memory_throttle_count", KSTAT_DATA_UINT64 }, + { "l2_write_trylock_fail", KSTAT_DATA_UINT64 }, + { "l2_write_passed_headroom", KSTAT_DATA_UINT64 }, + { "l2_write_spa_mismatch", KSTAT_DATA_UINT64 }, + { "l2_write_in_l2", KSTAT_DATA_UINT64 }, + { "l2_write_io_in_progress", KSTAT_DATA_UINT64 }, + { "l2_write_not_cacheable", KSTAT_DATA_UINT64 }, + { "l2_write_full", KSTAT_DATA_UINT64 }, + { "l2_write_buffer_iter", KSTAT_DATA_UINT64 }, + { "l2_write_pios", KSTAT_DATA_UINT64 }, + { "l2_write_bytes_written", KSTAT_DATA_UINT64 }, + { "l2_write_buffer_bytes_scanned", KSTAT_DATA_UINT64 }, + { "l2_write_buffer_list_iter", KSTAT_DATA_UINT64 }, + { "l2_write_buffer_list_null_iter", KSTAT_DATA_UINT64 } }; #define ARCSTAT(stat) (arc_stats.stat.value.ui64) @@ -336,7 +389,7 @@ static arc_stats_t arc_stats = { #define ARCSTAT_INCR(stat, val) \ atomic_add_64(&arc_stats.stat.value.ui64, (val)); -#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1) +#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1) #define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1) #define ARCSTAT_MAX(stat, val) { \ @@ -370,7 +423,7 @@ static arc_stats_t arc_stats = { } kstat_t *arc_ksp; -static arc_state_t *arc_anon; +static arc_state_t *arc_anon; static arc_state_t *arc_mru; static arc_state_t *arc_mru_ghost; static arc_state_t *arc_mfu; @@ -514,7 +567,7 @@ static void arc_evict_ghost(arc_state_t * Hash table routines */ -#define HT_LOCK_PAD 128 +#define HT_LOCK_PAD CACHE_LINE_SIZE struct ht_lock { kmutex_t ht_lock; @@ -527,7 +580,7 @@ struct ht_lock { typedef struct buf_hash_table { uint64_t ht_mask; arc_buf_hdr_t **ht_table; - struct ht_lock ht_locks[BUF_LOCKS]; + struct ht_lock ht_locks[BUF_LOCKS] __aligned(CACHE_LINE_SIZE); } buf_hash_table_t; static buf_hash_table_t buf_hash_table; @@ -541,13 +594,19 @@ static buf_hash_table_t buf_hash_table; uint64_t zfs_crc64_table[256]; +#ifdef ZIO_USE_UMA +extern kmem_cache_t *zio_buf_cache[]; +extern kmem_cache_t *zio_data_buf_cache[]; +#endif + /* * Level 2 ARC */ -#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */ -#define L2ARC_HEADROOM 4 /* num of writes */ +#define L2ARC_WRITE_SIZE (64 * 1024 * 1024) /* initial write max */ +#define L2ARC_HEADROOM 128 /* num of writes */ #define L2ARC_FEED_SECS 1 /* caching interval */ +#define L2ARC_FEED_SECS_SHIFT 1 /* caching interval shift */ #define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent) #define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done) @@ -559,7 +618,66 @@ uint64_t l2arc_write_max = L2ARC_WRITE_S uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */ uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */ uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */ -boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */ +uint64_t l2arc_feed_secs_shift = L2ARC_FEED_SECS_SHIFT; /* interval seconds shift */ +boolean_t l2arc_noprefetch = B_FALSE; /* don't cache prefetch bufs */ + + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_max, CTLFLAG_RW, + &l2arc_write_max, 0, "max write size"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_boost, CTLFLAG_RW, + &l2arc_write_boost, 0, "extra write during warmup"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_headroom, CTLFLAG_RW, + &l2arc_headroom, 0, "number of dev writes"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs, CTLFLAG_RW, + &l2arc_feed_secs, 0, "interval seconds"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs_shift, CTLFLAG_RW, + &l2arc_feed_secs_shift, 0, "power of 2 division of feed seconds"); + +SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_noprefetch, CTLFLAG_RW, + &l2arc_noprefetch, 0, "don't cache prefetch bufs"); + + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_size, CTLFLAG_RD, + &ARC_anon.arcs_size, 0, "size of anonymous state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_metadata_lsize, CTLFLAG_RD, + &ARC_anon.arcs_lsize[ARC_BUFC_METADATA], 0, "size of anonymous state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_data_lsize, CTLFLAG_RD, + &ARC_anon.arcs_lsize[ARC_BUFC_DATA], 0, "size of anonymous state"); + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_size, CTLFLAG_RD, + &ARC_mru.arcs_size, 0, "size of mru state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_metadata_lsize, CTLFLAG_RD, + &ARC_mru.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mru state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_data_lsize, CTLFLAG_RD, + &ARC_mru.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mru state"); + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_size, CTLFLAG_RD, + &ARC_mru_ghost.arcs_size, 0, "size of mru ghost state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_metadata_lsize, CTLFLAG_RD, + &ARC_mru_ghost.arcs_lsize[ARC_BUFC_METADATA], 0, + "size of metadata in mru ghost state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_data_lsize, CTLFLAG_RD, + &ARC_mru_ghost.arcs_lsize[ARC_BUFC_DATA], 0, + "size of data in mru ghost state"); + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_size, CTLFLAG_RD, + &ARC_mfu.arcs_size, 0, "size of mfu state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_metadata_lsize, CTLFLAG_RD, + &ARC_mfu.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mfu state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_data_lsize, CTLFLAG_RD, + &ARC_mfu.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mfu state"); + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_size, CTLFLAG_RD, + &ARC_mfu_ghost.arcs_size, 0, "size of mfu ghost state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_metadata_lsize, CTLFLAG_RD, + &ARC_mfu_ghost.arcs_lsize[ARC_BUFC_METADATA], 0, + "size of metadata in mfu ghost state"); +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_data_lsize, CTLFLAG_RD, + &ARC_mfu_ghost.arcs_lsize[ARC_BUFC_DATA], 0, + "size of data in mfu ghost state"); + +SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2c_only_size, CTLFLAG_RD, + &ARC_l2c_only.arcs_size, 0, "size of mru state"); /* * L2ARC Internals @@ -953,18 +1071,38 @@ arc_buf_freeze(arc_buf_t *buf) } static void +get_buf_info(arc_buf_hdr_t *ab, arc_state_t *state, list_t **list, kmutex_t **lock) +{ + uint64_t buf_hashid = buf_hash(ab->b_spa, &ab->b_dva, ab->b_birth); + + if (ab->b_type == ARC_BUFC_METADATA) + buf_hashid &= (ARC_BUFC_NUMMETADATALISTS - 1); + else { + buf_hashid &= (ARC_BUFC_NUMDATALISTS - 1); + buf_hashid += ARC_BUFC_NUMMETADATALISTS; + } + + *list = &state->arcs_lists[buf_hashid]; + *lock = ARCS_LOCK(state, buf_hashid); +} + + +static void add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag) { + ASSERT(MUTEX_HELD(hash_lock)); if ((refcount_add(&ab->b_refcnt, tag) == 1) && (ab->b_state != arc_anon)) { uint64_t delta = ab->b_size * ab->b_datacnt; - list_t *list = &ab->b_state->arcs_list[ab->b_type]; uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type]; + list_t *list; + kmutex_t *lock; - ASSERT(!MUTEX_HELD(&ab->b_state->arcs_mtx)); - mutex_enter(&ab->b_state->arcs_mtx); + get_buf_info(ab, ab->b_state, &list, &lock); + ASSERT(!MUTEX_HELD(lock)); + mutex_enter(lock); ASSERT(list_link_active(&ab->b_arc_node)); list_remove(list, ab); if (GHOST_STATE(ab->b_state)) { @@ -975,7 +1113,7 @@ add_reference(arc_buf_hdr_t *ab, kmutex_ ASSERT(delta > 0); ASSERT3U(*size, >=, delta); atomic_add_64(size, -delta); - mutex_exit(&ab->b_state->arcs_mtx); + mutex_exit(lock); /* remove the prefetch flag if we get a reference */ if (ab->b_flags & ARC_PREFETCH) ab->b_flags &= ~ARC_PREFETCH; @@ -994,14 +1132,17 @@ remove_reference(arc_buf_hdr_t *ab, kmut if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) && (state != arc_anon)) { uint64_t *size = &state->arcs_lsize[ab->b_type]; + list_t *list; + kmutex_t *lock; - ASSERT(!MUTEX_HELD(&state->arcs_mtx)); - mutex_enter(&state->arcs_mtx); + get_buf_info(ab, state, &list, &lock); + ASSERT(!MUTEX_HELD(lock)); + mutex_enter(lock); ASSERT(!list_link_active(&ab->b_arc_node)); - list_insert_head(&state->arcs_list[ab->b_type], ab); + list_insert_head(list, ab); ASSERT(ab->b_datacnt > 0); atomic_add_64(size, ab->b_size * ab->b_datacnt); - mutex_exit(&state->arcs_mtx); + mutex_exit(lock); } return (cnt); } @@ -1016,6 +1157,8 @@ arc_change_state(arc_state_t *new_state, arc_state_t *old_state = ab->b_state; int64_t refcnt = refcount_count(&ab->b_refcnt); uint64_t from_delta, to_delta; + list_t *list; + kmutex_t *lock; ASSERT(MUTEX_HELD(hash_lock)); ASSERT(new_state != old_state); @@ -1030,14 +1173,16 @@ arc_change_state(arc_state_t *new_state, */ if (refcnt == 0) { if (old_state != arc_anon) { - int use_mutex = !MUTEX_HELD(&old_state->arcs_mtx); + int use_mutex; uint64_t *size = &old_state->arcs_lsize[ab->b_type]; + get_buf_info(ab, old_state, &list, &lock); + use_mutex = !MUTEX_HELD(lock); if (use_mutex) - mutex_enter(&old_state->arcs_mtx); + mutex_enter(lock); ASSERT(list_link_active(&ab->b_arc_node)); - list_remove(&old_state->arcs_list[ab->b_type], ab); + list_remove(list, ab); /* * If prefetching out of the ghost cache, @@ -1052,16 +1197,18 @@ arc_change_state(arc_state_t *new_state, atomic_add_64(size, -from_delta); if (use_mutex) - mutex_exit(&old_state->arcs_mtx); + mutex_exit(lock); } if (new_state != arc_anon) { - int use_mutex = !MUTEX_HELD(&new_state->arcs_mtx); + int use_mutex; uint64_t *size = &new_state->arcs_lsize[ab->b_type]; + get_buf_info(ab, new_state, &list, &lock); + use_mutex = !MUTEX_HELD(lock); if (use_mutex) - mutex_enter(&new_state->arcs_mtx); + mutex_enter(lock); - list_insert_head(&new_state->arcs_list[ab->b_type], ab); + list_insert_head(list, ab); /* ghost elements have a ghost size */ if (GHOST_STATE(new_state)) { @@ -1072,7 +1219,7 @@ arc_change_state(arc_state_t *new_state, atomic_add_64(size, to_delta); if (use_mutex) - mutex_exit(&new_state->arcs_mtx); + mutex_exit(lock); } } @@ -1462,21 +1609,48 @@ arc_evict(arc_state_t *state, spa_t *spa { arc_state_t *evicted_state; uint64_t bytes_evicted = 0, skipped = 0, missed = 0; + int64_t bytes_remaining; arc_buf_hdr_t *ab, *ab_prev = NULL; - list_t *list = &state->arcs_list[type]; + list_t *evicted_list, *list, *evicted_list_start, *list_start; + kmutex_t *lock, *evicted_lock; kmutex_t *hash_lock; boolean_t have_lock; void *stolen = NULL; + static int evict_metadata_offset, evict_data_offset; + int i, idx, offset, list_count, count; ASSERT(state == arc_mru || state == arc_mfu); evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost; - mutex_enter(&state->arcs_mtx); - mutex_enter(&evicted_state->arcs_mtx); + if (type == ARC_BUFC_METADATA) { + offset = 0; + list_count = ARC_BUFC_NUMMETADATALISTS; + list_start = &state->arcs_lists[0]; + evicted_list_start = &evicted_state->arcs_lists[0]; + idx = evict_metadata_offset; + } else { + offset = ARC_BUFC_NUMMETADATALISTS; + list_start = &state->arcs_lists[offset]; + evicted_list_start = &evicted_state->arcs_lists[offset]; + list_count = ARC_BUFC_NUMDATALISTS; + idx = evict_data_offset; + } + bytes_remaining = evicted_state->arcs_lsize[type]; + count = 0; + +evict_start: + list = &list_start[idx]; + evicted_list = &evicted_list_start[idx]; + lock = ARCS_LOCK(state, (offset + idx)); + evicted_lock = ARCS_LOCK(evicted_state, (offset + idx)); + + mutex_enter(lock); + mutex_enter(evicted_lock); for (ab = list_tail(list); ab; ab = ab_prev) { ab_prev = list_prev(list, ab); + bytes_remaining -= (ab->b_size * ab->b_datacnt); /* prefetch buffers have a minimum lifespan */ if (HDR_IO_IN_PROGRESS(ab) || (spa && ab->b_spa != spa) || @@ -1536,17 +1710,35 @@ arc_evict(arc_state_t *state, spa_t *spa mutex_exit(hash_lock); if (bytes >= 0 && bytes_evicted >= bytes) break; + if (bytes_remaining > 0) { + mutex_exit(evicted_lock); + mutex_exit(lock); + idx = ((idx + 1) & (list_count - 1)); + count++; + goto evict_start; + } } else { missed += 1; } } - mutex_exit(&evicted_state->arcs_mtx); - mutex_exit(&state->arcs_mtx); + mutex_exit(evicted_lock); + mutex_exit(lock); + + idx = ((idx + 1) & (list_count - 1)); + count++; - if (bytes_evicted < bytes) - dprintf("only evicted %lld bytes from %x", - (longlong_t)bytes_evicted, state); + if (bytes_evicted < bytes) { + if (count < list_count) + goto evict_start; + else + dprintf("only evicted %lld bytes from %x", + (longlong_t)bytes_evicted, state); + } + if (type == ARC_BUFC_METADATA) + evict_metadata_offset = idx; + else + evict_data_offset = idx; if (skipped) ARCSTAT_INCR(arcstat_evict_skip, skipped); @@ -1574,6 +1766,8 @@ arc_evict(arc_state_t *state, spa_t *spa arc_evict_ghost(arc_mfu_ghost, NULL, todelete); } } + if (stolen) + ARCSTAT_BUMP(arcstat_stolen); return (stolen); } @@ -1586,14 +1780,28 @@ static void arc_evict_ghost(arc_state_t *state, spa_t *spa, int64_t bytes) { arc_buf_hdr_t *ab, *ab_prev; - list_t *list = &state->arcs_list[ARC_BUFC_DATA]; - kmutex_t *hash_lock; + list_t *list, *list_start; + kmutex_t *hash_lock, *lock; uint64_t bytes_deleted = 0; uint64_t bufs_skipped = 0; + static int evict_offset; + int list_count, idx = evict_offset; + int offset, count = 0; ASSERT(GHOST_STATE(state)); -top: - mutex_enter(&state->arcs_mtx); + + /* + * data lists come after metadata lists + */ + list_start = &state->arcs_lists[ARC_BUFC_NUMMETADATALISTS]; + list_count = ARC_BUFC_NUMDATALISTS; + offset = ARC_BUFC_NUMMETADATALISTS; + +evict_start: + list = &list_start[idx]; + lock = ARCS_LOCK(state, idx + offset); + + mutex_enter(lock); for (ab = list_tail(list); ab; ab = ab_prev) { ab_prev = list_prev(list, ab); if (spa && ab->b_spa != spa) @@ -1623,20 +1831,31 @@ top: break; } else { if (bytes < 0) { - mutex_exit(&state->arcs_mtx); + /* + * we're draining the ARC, retry + */ + mutex_exit(lock); mutex_enter(hash_lock); mutex_exit(hash_lock); - goto top; + goto evict_start; } bufs_skipped += 1; } } - mutex_exit(&state->arcs_mtx); + mutex_exit(lock); + idx = ((idx + 1) & (ARC_BUFC_NUMDATALISTS - 1)); + count++; + + if (count < list_count) + goto evict_start; - if (list == &state->arcs_list[ARC_BUFC_DATA] && + evict_offset = idx; + if ((uintptr_t)list > (uintptr_t)&state->arcs_lists[ARC_BUFC_NUMMETADATALISTS] && (bytes < 0 || bytes_deleted < bytes)) { - list = &state->arcs_list[ARC_BUFC_METADATA]; - goto top; + list_start = &state->arcs_lists[0]; + list_count = ARC_BUFC_NUMMETADATALISTS; + offset = count = 0; + goto evict_start; } if (bufs_skipped) { @@ -1718,7 +1937,7 @@ arc_do_user_evicts(void) /* * Move list over to avoid LOR */ -restart: +restart: mutex_enter(&arc_eviction_mtx); tmp_arc_eviction_list = arc_eviction_list; arc_eviction_list = NULL; @@ -1750,22 +1969,22 @@ restart: void arc_flush(spa_t *spa) { - while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) { + while (arc_mru->arcs_lsize[ARC_BUFC_DATA]) { (void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_DATA); if (spa) break; } - while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) { + while (arc_mru->arcs_lsize[ARC_BUFC_METADATA]) { (void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_METADATA); if (spa) break; } - while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) { + while (arc_mfu->arcs_lsize[ARC_BUFC_DATA]) { (void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_DATA); if (spa) break; } - while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) { + while (arc_mfu->arcs_lsize[ARC_BUFC_METADATA]) { (void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_METADATA); if (spa) break; @@ -1829,7 +2048,7 @@ arc_reclaim_needed(void) return (0); /* - * If pages are needed or we're within 2048 pages + * If pages are needed or we're within 2048 pages * of needing to page need to reclaim */ if (vm_pages_needed || (vm_paging_target() > -2048)) @@ -1896,8 +2115,6 @@ arc_kmem_reap_now(arc_reclaim_strategy_t size_t i; kmem_cache_t *prev_cache = NULL; kmem_cache_t *prev_data_cache = NULL; - extern kmem_cache_t *zio_buf_cache[]; - extern kmem_cache_t *zio_data_buf_cache[]; #endif #ifdef _KERNEL @@ -2203,6 +2420,7 @@ out: arc_anon->arcs_size + arc_mru->arcs_size > arc_p) arc_p = MIN(arc_c, arc_p + size); } + ARCSTAT_BUMP(arcstat_allocated); } /* @@ -2502,7 +2720,6 @@ arc_read(zio_t *pio, spa_t *spa, blkptr_ uint32_t *arc_flags, const zbookmark_t *zb) { int err; - arc_buf_hdr_t *hdr = pbuf->b_hdr; ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt)); ASSERT3U((char *)bp - (char *)pbuf->b_data, <, pbuf->b_hdr->b_size); @@ -2510,8 +2727,6 @@ arc_read(zio_t *pio, spa_t *spa, blkptr_ err = arc_read_nolock(pio, spa, bp, done, private, priority, zio_flags, arc_flags, zb); - - ASSERT3P(hdr, ==, pbuf->b_hdr); rw_exit(&pbuf->b_lock); return (err); } @@ -2728,7 +2943,7 @@ top: * released by l2arc_read_done(). */ rzio = zio_read_phys(pio, vd, addr, size, - buf->b_data, ZIO_CHECKSUM_OFF, + buf->b_data, ZIO_CHECKSUM_OFF, l2arc_read_done, cb, priority, zio_flags | ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | @@ -2823,6 +3038,8 @@ arc_buf_evict(arc_buf_t *buf) arc_buf_hdr_t *hdr; kmutex_t *hash_lock; arc_buf_t **bufp; + list_t *list, *evicted_list; + kmutex_t *lock, *evicted_lock; rw_enter(&buf->b_lock, RW_WRITER); hdr = buf->b_hdr; @@ -2871,16 +3088,18 @@ arc_buf_evict(arc_buf_t *buf) evicted_state = (old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost; - mutex_enter(&old_state->arcs_mtx); - mutex_enter(&evicted_state->arcs_mtx); + get_buf_info(hdr, old_state, &list, &lock); + get_buf_info(hdr, evicted_state, &evicted_list, &evicted_lock); + mutex_enter(lock); + mutex_enter(evicted_lock); arc_change_state(evicted_state, hdr, hash_lock); ASSERT(HDR_IN_HASH_TABLE(hdr)); hdr->b_flags |= ARC_IN_HASH_TABLE; hdr->b_flags &= ~ARC_BUF_AVAILABLE; - mutex_exit(&evicted_state->arcs_mtx); - mutex_exit(&old_state->arcs_mtx); + mutex_exit(evicted_lock); + mutex_exit(lock); } mutex_exit(hash_lock); rw_exit(&buf->b_lock); @@ -3426,7 +3645,8 @@ void arc_init(void) { int prefetch_tunable_set = 0; - + int i; + mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL); mutex_init(&arc_lowmem_lock, NULL, MUTEX_DEFAULT, NULL); @@ -3494,33 +3714,33 @@ arc_init(void) arc_l2c_only = &ARC_l2c_only; arc_size = 0; - mutex_init(&arc_anon->arcs_mtx, NULL, MUTEX_DEFAULT, NULL); - mutex_init(&arc_mru->arcs_mtx, NULL, MUTEX_DEFAULT, NULL); - mutex_init(&arc_mru_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL); - mutex_init(&arc_mfu->arcs_mtx, NULL, MUTEX_DEFAULT, NULL); - mutex_init(&arc_mfu_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL); - mutex_init(&arc_l2c_only->arcs_mtx, NULL, MUTEX_DEFAULT, NULL); - - list_create(&arc_mru->arcs_list[ARC_BUFC_METADATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mru->arcs_list[ARC_BUFC_DATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mfu->arcs_list[ARC_BUFC_DATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); - list_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA], - sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + for (i = 0; i < ARC_BUFC_NUMLISTS; i++) { + mutex_init(&arc_anon->arcs_locks[i].arcs_lock, + NULL, MUTEX_DEFAULT, NULL); + mutex_init(&arc_mru->arcs_locks[i].arcs_lock, + NULL, MUTEX_DEFAULT, NULL); + mutex_init(&arc_mru_ghost->arcs_locks[i].arcs_lock, + NULL, MUTEX_DEFAULT, NULL); + mutex_init(&arc_mfu->arcs_locks[i].arcs_lock, + NULL, MUTEX_DEFAULT, NULL); + mutex_init(&arc_mfu_ghost->arcs_locks[i].arcs_lock, + NULL, MUTEX_DEFAULT, NULL); + mutex_init(&arc_l2c_only->arcs_locks[i].arcs_lock, + NULL, MUTEX_DEFAULT, NULL); + + list_create(&arc_mru->arcs_lists[i], + sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + list_create(&arc_mru_ghost->arcs_lists[i], + sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + list_create(&arc_mfu->arcs_lists[i], + sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + list_create(&arc_mfu_ghost->arcs_lists[i], + sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + list_create(&arc_mfu_ghost->arcs_lists[i], + sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + list_create(&arc_l2c_only->arcs_lists[i], + sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node)); + } buf_init(); @@ -3557,7 +3777,7 @@ arc_init(void) #ifdef _KERNEL if (TUNABLE_INT_FETCH("vfs.zfs.prefetch_disable", &zfs_prefetch_disable)) prefetch_tunable_set = 1; - + #ifdef __i386__ if (prefetch_tunable_set == 0) { printf("ZFS NOTICE: Prefetch is disabled by default on i386 " @@ -3566,7 +3786,7 @@ arc_init(void) "to /boot/loader.conf.\n"); zfs_prefetch_disable=1; } -#else +#else if ((((uint64_t)physmem * PAGESIZE) < (1ULL << 32)) && prefetch_tunable_set == 0) { printf("ZFS NOTICE: Prefetch is disabled by default if less " @@ -3575,7 +3795,7 @@ arc_init(void) "to /boot/loader.conf.\n"); zfs_prefetch_disable=1; } -#endif +#endif /* Warn about ZFS memory and address space requirements. */ if (((uint64_t)physmem * PAGESIZE) < (256 + 128 + 64) * (1 << 20)) { printf("ZFS WARNING: Recommended minimum RAM size is 512MB; " @@ -3594,6 +3814,7 @@ arc_init(void) void arc_fini(void) { + int i; mutex_enter(&arc_reclaim_thr_lock); arc_thread_exit = 1; @@ -3615,20 +3836,20 @@ arc_fini(void) mutex_destroy(&arc_reclaim_thr_lock); cv_destroy(&arc_reclaim_thr_cv); - list_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]); - list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]); - list_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]); - list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]); - list_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]); - list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]); - list_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]); - list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]); - - mutex_destroy(&arc_anon->arcs_mtx); - mutex_destroy(&arc_mru->arcs_mtx); - mutex_destroy(&arc_mru_ghost->arcs_mtx); - mutex_destroy(&arc_mfu->arcs_mtx); - mutex_destroy(&arc_mfu_ghost->arcs_mtx); + for (i = 0; i < ARC_BUFC_NUMLISTS; i++) { + list_destroy(&arc_mru->arcs_lists[i]); + list_destroy(&arc_mru_ghost->arcs_lists[i]); + list_destroy(&arc_mfu->arcs_lists[i]); + list_destroy(&arc_mfu_ghost->arcs_lists[i]); + list_destroy(&arc_l2c_only->arcs_lists[i]); + + mutex_destroy(&arc_anon->arcs_locks[i].arcs_lock); + mutex_destroy(&arc_mru->arcs_locks[i].arcs_lock); + mutex_destroy(&arc_mru_ghost->arcs_locks[i].arcs_lock); + mutex_destroy(&arc_mfu->arcs_locks[i].arcs_lock); + mutex_destroy(&arc_mfu_ghost->arcs_locks[i].arcs_lock); + mutex_destroy(&arc_l2c_only->arcs_locks[i].arcs_lock); + } mutex_destroy(&zfs_write_limit_lock); @@ -4024,26 +4245,27 @@ static list_t * l2arc_list_locked(int list_num, kmutex_t **lock) { list_t *list; + int idx; - ASSERT(list_num >= 0 && list_num <= 3); + ASSERT(list_num >= 0 && list_num < 2 * ARC_BUFC_NUMLISTS); - switch (list_num) { - case 0: - list = &arc_mfu->arcs_list[ARC_BUFC_METADATA]; - *lock = &arc_mfu->arcs_mtx; - break; - case 1: - list = &arc_mru->arcs_list[ARC_BUFC_METADATA]; - *lock = &arc_mru->arcs_mtx; - break; - case 2: - list = &arc_mfu->arcs_list[ARC_BUFC_DATA]; - *lock = &arc_mfu->arcs_mtx; - break; - case 3: - list = &arc_mru->arcs_list[ARC_BUFC_DATA]; - *lock = &arc_mru->arcs_mtx; - break; + if (list_num < ARC_BUFC_NUMMETADATALISTS) { + idx = list_num; + list = &arc_mfu->arcs_lists[idx]; + *lock = ARCS_LOCK(arc_mfu, idx); + } else if (list_num < ARC_BUFC_NUMMETADATALISTS * 2) { + idx = list_num - ARC_BUFC_NUMMETADATALISTS; + list = &arc_mru->arcs_lists[idx]; + *lock = ARCS_LOCK(arc_mru, idx); + } else if (list_num < (ARC_BUFC_NUMMETADATALISTS * 2 + + ARC_BUFC_NUMDATALISTS)) { + idx = list_num - ARC_BUFC_NUMMETADATALISTS; + list = &arc_mfu->arcs_lists[idx]; + *lock = ARCS_LOCK(arc_mfu, idx); + } else { + idx = list_num - ARC_BUFC_NUMLISTS; + list = &arc_mru->arcs_lists[idx]; + *lock = ARCS_LOCK(arc_mru, idx); } ASSERT(!(MUTEX_HELD(*lock))); @@ -4210,13 +4432,15 @@ l2arc_write_buffers(spa_t *spa, l2arc_de head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE); head->b_flags |= ARC_L2_WRITE_HEAD; + ARCSTAT_BUMP(arcstat_l2_write_buffer_iter); /* * Copy buffers for L2ARC writing. */ mutex_enter(&l2arc_buflist_mtx); - for (try = 0; try <= 3; try++) { + for (try = 0; try < 2 * ARC_BUFC_NUMLISTS; try++) { list = l2arc_list_locked(try, &list_lock); passed_sz = 0; + ARCSTAT_BUMP(arcstat_l2_write_buffer_list_iter); /* * L2ARC fast warmup. @@ -4229,52 +4453,65 @@ l2arc_write_buffers(spa_t *spa, l2arc_de ab = list_head(list); else ab = list_tail(list); + if (ab == NULL) + ARCSTAT_BUMP(arcstat_l2_write_buffer_list_null_iter); for (; ab; ab = ab_prev) { if (arc_warm == B_FALSE) ab_prev = list_next(list, ab); else ab_prev = list_prev(list, ab); + ARCSTAT_INCR(arcstat_l2_write_buffer_bytes_scanned, ab->b_size); hash_lock = HDR_LOCK(ab); have_lock = MUTEX_HELD(hash_lock); if (!have_lock && !mutex_tryenter(hash_lock)) { + ARCSTAT_BUMP(arcstat_l2_write_trylock_fail); /* * Skip this buffer rather than waiting. */ continue; } + if (ab->b_l2hdr != NULL) { + /* + * Already in L2ARC. + */ + mutex_exit(hash_lock); + ARCSTAT_BUMP(arcstat_l2_write_in_l2); + continue; + } + passed_sz += ab->b_size; if (passed_sz > headroom) { /* * Searched too far. */ mutex_exit(hash_lock); + ARCSTAT_BUMP(arcstat_l2_write_passed_headroom); break; } if (ab->b_spa != spa) { mutex_exit(hash_lock); + ARCSTAT_BUMP(arcstat_l2_write_spa_mismatch); continue; } - if (ab->b_l2hdr != NULL) { - /* - * Already in L2ARC. - */ + if (HDR_IO_IN_PROGRESS(ab)) { mutex_exit(hash_lock); + ARCSTAT_BUMP(arcstat_l2_write_hdr_io_in_progress); continue; } - - if (HDR_IO_IN_PROGRESS(ab) || !HDR_L2CACHE(ab)) { + if (!HDR_L2CACHE(ab)) { mutex_exit(hash_lock); + ARCSTAT_BUMP(arcstat_l2_write_not_cacheable); continue; } - if ((write_sz + ab->b_size) > target_sz) { full = B_TRUE; mutex_exit(hash_lock); + ARCSTAT_BUMP(arcstat_l2_write_full); break; } @@ -4298,8 +4535,10 @@ l2arc_write_buffers(spa_t *spa, l2arc_de cb->l2wcb_head = head; pio = zio_root(spa, l2arc_write_done, cb, ZIO_FLAG_CANFAIL); + ARCSTAT_BUMP(arcstat_l2_write_pios); } + ARCSTAT_INCR(arcstat_l2_write_bytes_written, ab->b_size); /* * Create and add a new L2ARC header. */ @@ -4395,7 +4634,7 @@ l2arc_feed_thread(void *dummy __unused) */ CALLB_CPR_SAFE_BEGIN(&cpr); (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock, - hz * l2arc_feed_secs); + hz * l2arc_feed_secs >> l2arc_feed_secs_shift); CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock); /* Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c Sun Apr 18 21:29:28 2010 (r206814) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c Sun Apr 18 21:36:34 2010 (r206815) @@ -2210,9 +2210,6 @@ dbuf_write_ready(zio_t *zio, arc_buf_t * for (i = db->db.db_size >> SPA_BLKPTRSHIFT; i > 0; i--, ibp++) { if (BP_IS_HOLE(ibp)) continue; - ASSERT3U(BP_GET_LSIZE(ibp), ==, - db->db_level == 1 ? dn->dn_datablksz : - (1<dn_phys->dn_indblkshift)); fill += ibp->blk_fill; } } Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c ============================================================================== --- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c Sun Apr 18 21:29:28 2010 (r206814) +++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c Sun Apr 18 21:36:34 2010 (r206815) @@ -49,11 +49,11 @@ uint32_t zfetch_block_cap = 256; uint64_t zfetch_array_rd_sz = 1024 * 1024; SYSCTL_DECL(_vfs_zfs); -SYSCTL_INT(_vfs_zfs, OID_AUTO, prefetch_disable, CTLFLAG_RDTUN, +SYSCTL_INT(_vfs_zfs, OID_AUTO, prefetch_disable, CTLFLAG_RW, &zfs_prefetch_disable, 0, "Disable prefetch"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, zfetch, CTLFLAG_RW, 0, "ZFS ZFETCH"); TUNABLE_INT("vfs.zfs.zfetch.max_streams", &zfetch_max_streams); -SYSCTL_UINT(_vfs_zfs_zfetch, OID_AUTO, max_streams, CTLFLAG_RDTUN, +SYSCTL_UINT(_vfs_zfs_zfetch, OID_AUTO, max_streams, CTLFLAG_RW, &zfetch_max_streams, 0, "Max # of streams per zfetch"); TUNABLE_INT("vfs.zfs.zfetch.min_sec_reap", &zfetch_min_sec_reap); SYSCTL_UINT(_vfs_zfs_zfetch, OID_AUTO, min_sec_reap, CTLFLAG_RDTUN, @@ -338,8 +338,10 @@ top: *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***