From owner-svn-src-stable-11@freebsd.org Thu Apr 2 00:50:51 2020 Return-Path: Delivered-To: svn-src-stable-11@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 53DD52AA9AC; Thu, 2 Apr 2020 00:50:51 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 48t4HZ06Psz3xRs; Thu, 2 Apr 2020 00:50:50 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 86B546F93; Thu, 2 Apr 2020 00:30:37 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id 0320UbnZ012096; Thu, 2 Apr 2020 00:30:37 GMT (envelope-from mav@FreeBSD.org) Received: (from mav@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id 0320UaO1012090; Thu, 2 Apr 2020 00:30:36 GMT (envelope-from mav@FreeBSD.org) Message-Id: <202004020030.0320UaO1012090@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mav set sender to mav@FreeBSD.org using -f From: Alexander Motin Date: Thu, 2 Apr 2020 00:30:36 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org Subject: svn commit: r359554 - in stable/11: cddl/contrib/opensolaris/cmd/ztest sys/cddl/contrib/opensolaris/uts/common/fs/zfs sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys X-SVN-Group: stable-11 X-SVN-Commit-Author: mav X-SVN-Commit-Paths: in stable/11: cddl/contrib/opensolaris/cmd/ztest sys/cddl/contrib/opensolaris/uts/common/fs/zfs sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys X-SVN-Commit-Revision: 359554 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-stable-11@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for only the 11-stable src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Apr 2020 00:50:51 -0000 Author: mav Date: Thu Apr 2 00:30:36 2020 New Revision: 359554 URL: https://svnweb.freebsd.org/changeset/base/359554 Log: MFC r359112: MFOpenZFS: make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson Reviewed-by: Paul Dagnelie Reviewed-by: Brian Behlendorf Signed-off-by: Matthew Ahrens Closes #8865 openzfs/zfs@b8738257c2607c73c731ce8e0fd73282b266d6ef Modified: stable/11/cddl/contrib/opensolaris/cmd/ztest/ztest.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_log.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c Directory Properties: stable/11/ (props changed) Modified: stable/11/cddl/contrib/opensolaris/cmd/ztest/ztest.c ============================================================================== --- stable/11/cddl/contrib/opensolaris/cmd/ztest/ztest.c Thu Apr 2 00:30:01 2020 (r359553) +++ stable/11/cddl/contrib/opensolaris/cmd/ztest/ztest.c Thu Apr 2 00:30:36 2020 (r359554) @@ -1379,7 +1379,7 @@ ztest_log_write(ztest_ds_t *zd, dmu_tx_t *tx, lr_write if (zil_replaying(zd->zd_zilog, tx)) return; - if (lr->lr_length > ZIL_MAX_LOG_DATA) + if (lr->lr_length > zil_max_log_data(zd->zd_zilog)) write_state = WR_INDIRECT; itx = zil_itx_create(TX_WRITE, Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h Thu Apr 2 00:30:01 2020 (r359553) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h Thu Apr 2 00:30:36 2020 (r359554) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2012, 2017 by Delphix. All rights reserved. + * Copyright (c) 2012, 2018 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -438,6 +438,9 @@ extern int zil_bp_tree_add(zilog_t *zilog, const blkpt extern void zil_set_sync(zilog_t *zilog, uint64_t syncval); extern void zil_set_logbias(zilog_t *zilog, uint64_t slogval); + +extern uint64_t zil_max_copied_data(zilog_t *zilog); +extern uint64_t zil_max_log_data(zilog_t *zilog); extern int zil_replay_disable; Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h Thu Apr 2 00:30:01 2020 (r359553) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h Thu Apr 2 00:30:36 2020 (r359554) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2012, 2017 by Delphix. All rights reserved. + * Copyright (c) 2012, 2018 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -206,34 +206,19 @@ struct zilog { uint_t zl_prev_rotor; /* rotor for zl_prev[] */ txg_node_t zl_dirty_link; /* protected by dp_dirty_zilogs list */ uint64_t zl_dirty_max_txg; /* highest txg used to dirty zilog */ + /* + * Max block size for this ZIL. Note that this can not be changed + * while the ZIL is in use because consumers (ZPL/zvol) need to take + * this into account when deciding between WR_COPIED and WR_NEED_COPY + * (see zil_max_copied_data()). + */ + uint64_t zl_max_block_size; }; typedef struct zil_bp_node { dva_t zn_dva; avl_node_t zn_node; } zil_bp_node_t; - -/* - * Maximum amount of write data that can be put into single log block. - */ -#define ZIL_MAX_LOG_DATA (SPA_OLD_MAXBLOCKSIZE - sizeof (zil_chain_t) - \ - sizeof (lr_write_t)) -#define ZIL_MAX_COPIED_DATA \ - ((SPA_OLD_MAXBLOCKSIZE - sizeof (zil_chain_t)) / 2 - sizeof (lr_write_t)) - -/* - * Maximum amount of log space we agree to waste to reduce number of - * WR_NEED_COPY chunks to reduce zl_get_data() overhead (~12%). - */ -#define ZIL_MAX_WASTE_SPACE (ZIL_MAX_LOG_DATA / 8) - -/* - * Maximum amount of write data for WR_COPIED. Fall back to WR_NEED_COPY - * as more space efficient if we can't fit at least two log records into - * maximum sized log block. - */ -#define ZIL_MAX_COPIED_DATA ((SPA_OLD_MAXBLOCKSIZE - \ - sizeof (zil_chain_t)) / 2 - sizeof (lr_write_t)) #ifdef __cplusplus } Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_log.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_log.c Thu Apr 2 00:30:01 2020 (r359553) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_log.c Thu Apr 2 00:30:36 2020 (r359554) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2015 by Delphix. All rights reserved. + * Copyright (c) 2015, 2018 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -491,7 +491,14 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype itx_wr_state_t wr_state = write_state; ssize_t len = resid; - if (wr_state == WR_COPIED && resid > ZIL_MAX_COPIED_DATA) + /* + * A WR_COPIED record must fit entirely in one log block. + * Large writes can use WR_NEED_COPY, which the ZIL will + * split into multiple records across several log blocks + * if necessary. + */ + if (wr_state == WR_COPIED && + resid > zil_max_copied_data(zilog)) wr_state = WR_NEED_COPY; else if (wr_state == WR_INDIRECT) len = MIN(blocksize - P2PHASE(off, blocksize), resid); Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Thu Apr 2 00:30:01 2020 (r359553) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Thu Apr 2 00:30:36 2020 (r359554) @@ -1252,6 +1252,15 @@ struct { }; /* + * Maximum block size used by the ZIL. This is picked up when the ZIL is + * initialized. Otherwise this should not be used directly; see + * zl_max_block_size instead. + */ +int zil_maxblocksize = SPA_OLD_MAXBLOCKSIZE; +SYSCTL_INT(_vfs_zfs, OID_AUTO, zil_maxblocksize, CTLFLAG_RWTUN, + &zil_maxblocksize, 0, "Limit in bytes of ZIL log block size"); + +/* * Start a log block write and advance to the next log block. * Calls are serialized. */ @@ -1327,7 +1336,7 @@ zil_lwb_write_issue(zilog_t *zilog, lwb_t *lwb) zil_blksz = zilog->zl_cur_used + sizeof (zil_chain_t); for (i = 0; zil_blksz > zil_block_buckets[i].limit; i++) continue; - zil_blksz = zil_block_buckets[i].blksz; + zil_blksz = MIN(zil_block_buckets[i].blksz, zilog->zl_max_block_size); zilog->zl_prev_blks[zilog->zl_prev_rotor] = zil_blksz; for (i = 0; i < ZIL_PREV_BLKS; i++) zil_blksz = MAX(zil_blksz, zilog->zl_prev_blks[i]); @@ -1384,13 +1393,47 @@ zil_lwb_write_issue(zilog_t *zilog, lwb_t *lwb) return (nlwb); } +/* + * Maximum amount of write data that can be put into single log block. + */ +uint64_t +zil_max_log_data(zilog_t *zilog) +{ + return (zilog->zl_max_block_size - + sizeof (zil_chain_t) - sizeof (lr_write_t)); +} + +/* + * Maximum amount of log space we agree to waste to reduce number of + * WR_NEED_COPY chunks to reduce zl_get_data() overhead (~12%). + */ +static inline uint64_t +zil_max_waste_space(zilog_t *zilog) +{ + return (zil_max_log_data(zilog) / 8); +} + +/* + * Maximum amount of write data for WR_COPIED. For correctness, consumers + * must fall back to WR_NEED_COPY if we can't fit the entire record into one + * maximum sized log block, because each WR_COPIED record must fit in a + * single log block. For space efficiency, we want to fit two records into a + * max-sized log block. + */ +uint64_t +zil_max_copied_data(zilog_t *zilog) +{ + return ((zilog->zl_max_block_size - sizeof (zil_chain_t)) / 2 - + sizeof (lr_write_t)); +} + static lwb_t * zil_lwb_commit(zilog_t *zilog, itx_t *itx, lwb_t *lwb) { lr_t *lrcb, *lrc; lr_write_t *lrwb, *lrw; char *lr_buf; - uint64_t dlen, dnow, lwb_sp, reclen, txg; + uint64_t dlen, dnow, lwb_sp, reclen, txg, max_log_data; ASSERT(MUTEX_HELD(&zilog->zl_issuer_lock)); ASSERT3P(lwb, !=, NULL); @@ -1439,15 +1482,27 @@ cont: * For WR_NEED_COPY optimize layout for minimal number of chunks. */ lwb_sp = lwb->lwb_sz - lwb->lwb_nused; + max_log_data = zil_max_log_data(zilog); if (reclen > lwb_sp || (reclen + dlen > lwb_sp && - lwb_sp < ZIL_MAX_WASTE_SPACE && (dlen % ZIL_MAX_LOG_DATA == 0 || - lwb_sp < reclen + dlen % ZIL_MAX_LOG_DATA))) { + lwb_sp < zil_max_waste_space(zilog) && + (dlen % max_log_data == 0 || + lwb_sp < reclen + dlen % max_log_data))) { lwb = zil_lwb_write_issue(zilog, lwb); if (lwb == NULL) return (NULL); zil_lwb_write_open(zilog, lwb); ASSERT(LWB_EMPTY(lwb)); lwb_sp = lwb->lwb_sz - lwb->lwb_nused; + + /* + * There must be enough space in the new, empty log block to + * hold reclen. For WR_COPIED, we need to fit the whole + * record in one block, and reclen is the header size + the + * data size. For WR_NEED_COPY, we can create multiple + * records, splitting the data into multiple blocks, so we + * only need to fit one word of data per block; in this case + * reclen is just the header size (no data). + */ ASSERT3U(reclen + MIN(dlen, sizeof (uint64_t)), <=, lwb_sp); } @@ -2872,6 +2927,7 @@ zil_alloc(objset_t *os, zil_header_t *zh_phys) zilog->zl_dirty_max_txg = 0; zilog->zl_last_lwb_opened = NULL; zilog->zl_last_lwb_latency = 0; + zilog->zl_max_block_size = zil_maxblocksize; mutex_init(&zilog->zl_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&zilog->zl_issuer_lock, NULL, MUTEX_DEFAULT, NULL); Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c Thu Apr 2 00:30:01 2020 (r359553) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c Thu Apr 2 00:30:36 2020 (r359554) @@ -1441,7 +1441,7 @@ zvol_log_write(zvol_state_t *zv, dmu_tx_t *tx, offset_ itx_wr_state_t wr_state = write_state; ssize_t len = resid; - if (wr_state == WR_COPIED && resid > ZIL_MAX_COPIED_DATA) + if (wr_state == WR_COPIED && resid > zil_max_copied_data(zilog)) wr_state = WR_NEED_COPY; else if (wr_state == WR_INDIRECT) len = MIN(blocksize - P2PHASE(off, blocksize), resid);