From owner-svn-src-all@freebsd.org Tue Aug 8 10:44:50 2017 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 32149DD29B5; Tue, 8 Aug 2017 10:44:50 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0C2907E519; Tue, 8 Aug 2017 10:44:49 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v78AinUp030454; Tue, 8 Aug 2017 10:44:49 GMT (envelope-from avg@FreeBSD.org) Received: (from avg@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id v78Aimom030451; Tue, 8 Aug 2017 10:44:48 GMT (envelope-from avg@FreeBSD.org) Message-Id: <201708081044.v78Aimom030451@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: avg set sender to avg@FreeBSD.org using -f From: Andriy Gapon Date: Tue, 8 Aug 2017 10:44:48 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-vendor@freebsd.org Subject: svn commit: r322223 - vendor-sys/illumos/dist/uts/common/fs/zfs vendor/illumos/dist/cmd/ztest X-SVN-Group: vendor-sys X-SVN-Commit-Author: avg X-SVN-Commit-Paths: vendor-sys/illumos/dist/uts/common/fs/zfs vendor/illumos/dist/cmd/ztest X-SVN-Commit-Revision: 322223 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Aug 2017 10:44:50 -0000 Author: avg Date: Tue Aug 8 10:44:48 2017 New Revision: 322223 URL: https://svnweb.freebsd.org/changeset/base/322223 Log: 8378 crash due to bp in-memory modification of nopwrite block illumos/illumos-gate@b7edcb940884114e61382937505433c4c38c0278 https://github.com/illumos/illumos-gate/commit/b7edcb940884114e61382937505433c4c38c0278 https://www.illumos.org/issues/8378 The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. Reviewed by: Prakash Surya Reviewed by: George Wilson Approved by: Robert Mustacchi Author: Matthew Ahrens Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/dmu.c vendor-sys/illumos/dist/uts/common/fs/zfs/zfs_vnops.c vendor-sys/illumos/dist/uts/common/fs/zfs/zvol.c Changes in other areas also in this revision: Modified: vendor/illumos/dist/cmd/ztest/ztest.c Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/dmu.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/dmu.c Tue Aug 8 10:43:41 2017 (r322222) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/dmu.c Tue Aug 8 10:44:48 2017 (r322223) @@ -1574,6 +1574,7 @@ dmu_sync_done(zio_t *zio, arc_buf_t *buf, void *varg) uint8_t chksum = BP_GET_CHECKSUM(bp_orig); ASSERT(BP_EQUAL(bp, bp_orig)); + VERIFY(BP_EQUAL(bp, db->db_blkptr)); ASSERT(zio->io_prop.zp_compress != ZIO_COMPRESS_OFF); ASSERT(zio_checksum_table[chksum].ci_flags & ZCHECKSUM_FLAG_NOPWRITE); @@ -1614,19 +1615,11 @@ dmu_sync_late_arrival_done(zio_t *zio) blkptr_t *bp_orig = &zio->io_bp_orig; if (zio->io_error == 0 && !BP_IS_HOLE(bp)) { - /* - * If we didn't allocate a new block (i.e. ZIO_FLAG_NOPWRITE) - * then there is nothing to do here. Otherwise, free the - * newly allocated block in this txg. - */ - if (zio->io_flags & ZIO_FLAG_NOPWRITE) { - ASSERT(BP_EQUAL(bp, bp_orig)); - } else { - ASSERT(BP_IS_HOLE(bp_orig) || !BP_EQUAL(bp, bp_orig)); - ASSERT(zio->io_bp->blk_birth == zio->io_txg); - ASSERT(zio->io_txg > spa_syncing_txg(zio->io_spa)); - zio_free(zio->io_spa, zio->io_txg, zio->io_bp); - } + ASSERT(!(zio->io_flags & ZIO_FLAG_NOPWRITE)); + ASSERT(BP_IS_HOLE(bp_orig) || !BP_EQUAL(bp, bp_orig)); + ASSERT(zio->io_bp->blk_birth == zio->io_txg); + ASSERT(zio->io_txg > spa_syncing_txg(zio->io_spa)); + zio_free(zio->io_spa, zio->io_txg, zio->io_bp); } dmu_tx_commit(dsa->dsa_tx); @@ -1658,6 +1651,29 @@ dmu_sync_late_arrival(zio_t *pio, objset_t *os, dmu_sy dsa->dsa_zgd = zgd; dsa->dsa_tx = tx; + /* + * Since we are currently syncing this txg, it's nontrivial to + * determine what BP to nopwrite against, so we disable nopwrite. + * + * When syncing, the db_blkptr is initially the BP of the previous + * txg. We can not nopwrite against it because it will be changed + * (this is similar to the non-late-arrival case where the dbuf is + * dirty in a future txg). + * + * Then dbuf_write_ready() sets bp_blkptr to the location we will write. + * We can not nopwrite against it because although the BP will not + * (typically) be changed, the data has not yet been persisted to this + * location. + * + * Finally, when dbuf_write_done() is called, it is theoretically + * possible to always nopwrite, because the data that was written in + * this txg is the same data that we are trying to write. However we + * would need to check that this dbuf is not dirty in any future + * txg's (as we do in the normal dmu_sync() path). For simplicity, we + * don't nopwrite in this case. + */ + zp->zp_nopwrite = B_FALSE; + zio_nowait(zio_write(pio, os->os_spa, dmu_tx_get_txg(tx), zgd->zgd_bp, abd_get_from_buf(zgd->zgd_db->db_data, zgd->zgd_db->db_size), zgd->zgd_db->db_size, zgd->zgd_db->db_size, zp, @@ -1695,7 +1711,6 @@ dmu_sync_late_arrival(zio_t *pio, objset_t *os, dmu_sy int dmu_sync(zio_t *pio, uint64_t txg, dmu_sync_cb_t *done, zgd_t *zgd) { - blkptr_t *bp = zgd->zgd_bp; dmu_buf_impl_t *db = (dmu_buf_impl_t *)zgd->zgd_db; objset_t *os = db->db_objset; dsl_dataset_t *ds = os->os_dsl_dataset; @@ -1762,6 +1777,21 @@ dmu_sync(zio_t *pio, uint64_t txg, dmu_sync_cb_t *done ASSERT(dr->dr_next == NULL || dr->dr_next->dr_txg < txg); + if (db->db_blkptr != NULL) { + /* + * We need to fill in zgd_bp with the current blkptr so that + * the nopwrite code can check if we're writing the same + * data that's already on disk. We can only nopwrite if we + * are sure that after making the copy, db_blkptr will not + * change until our i/o completes. We ensure this by + * holding the db_mtx, and only allowing nopwrite if the + * block is not already dirty (see below). This is verified + * by dmu_sync_done(), which VERIFYs that the db_blkptr has + * not changed. + */ + *zgd->zgd_bp = *db->db_blkptr; + } + /* * Assume the on-disk data is X, the current syncing data (in * txg - 1) is Y, and the current in-memory data is Z (currently @@ -1813,7 +1843,7 @@ dmu_sync(zio_t *pio, uint64_t txg, dmu_sync_cb_t *done dsa->dsa_tx = NULL; zio_nowait(arc_write(pio, os->os_spa, txg, - bp, dr->dt.dl.dr_data, DBUF_IS_L2CACHEABLE(db), + zgd->zgd_bp, dr->dt.dl.dr_data, DBUF_IS_L2CACHEABLE(db), &zp, dmu_sync_ready, NULL, NULL, dmu_sync_done, dsa, ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, &zb)); Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/zfs_vnops.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/zfs_vnops.c Tue Aug 8 10:43:41 2017 (r322222) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/zfs_vnops.c Tue Aug 8 10:44:48 2017 (r322223) @@ -1053,7 +1053,6 @@ zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio uint64_t object = lr->lr_foid; uint64_t offset = lr->lr_offset; uint64_t size = lr->lr_length; - blkptr_t *bp = &lr->lr_blkptr; dmu_buf_t *db; zgd_t *zgd; int error = 0; @@ -1130,11 +1129,7 @@ zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio DMU_READ_NO_PREFETCH); if (error == 0) { - blkptr_t *obp = dmu_buf_get_blkptr(db); - if (obp) { - ASSERT(BP_IS_HOLE(bp)); - *bp = *obp; - } + blkptr_t *bp = &lr->lr_blkptr; zgd->zgd_db = db; zgd->zgd_bp = bp; Modified: vendor-sys/illumos/dist/uts/common/fs/zfs/zvol.c ============================================================================== --- vendor-sys/illumos/dist/uts/common/fs/zfs/zvol.c Tue Aug 8 10:43:41 2017 (r322222) +++ vendor-sys/illumos/dist/uts/common/fs/zfs/zvol.c Tue Aug 8 10:44:48 2017 (r322223) @@ -24,7 +24,7 @@ * Portions Copyright 2010 Robert Milkowski * * Copyright 2011 Nexenta Systems, Inc. All rights reserved. - * Copyright (c) 2012, 2016 by Delphix. All rights reserved. + * Copyright (c) 2012, 2017 by Delphix. All rights reserved. * Copyright (c) 2013, Joyent, Inc. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -991,7 +991,6 @@ zvol_get_data(void *arg, lr_write_t *lr, char *buf, zi uint64_t object = ZVOL_OBJ; uint64_t offset = lr->lr_offset; uint64_t size = lr->lr_length; /* length of user data */ - blkptr_t *bp = &lr->lr_blkptr; dmu_buf_t *db; zgd_t *zgd; int error; @@ -1019,11 +1018,7 @@ zvol_get_data(void *arg, lr_write_t *lr, char *buf, zi error = dmu_buf_hold(os, object, offset, zgd, &db, DMU_READ_NO_PREFETCH); if (error == 0) { - blkptr_t *obp = dmu_buf_get_blkptr(db); - if (obp) { - ASSERT(BP_IS_HOLE(bp)); - *bp = *obp; - } + blkptr_t *bp = &lr->lr_blkptr; zgd->zgd_db = db; zgd->zgd_bp = bp;