From owner-svn-src-head@freebsd.org Mon Nov 28 15:14:32 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9E18DC59BA7; Mon, 28 Nov 2016 15:14:32 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 78E141CD6; Mon, 28 Nov 2016 15:14:32 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id uASFEVYJ028828; Mon, 28 Nov 2016 15:14:31 GMT (envelope-from avg@FreeBSD.org) Received: (from avg@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id uASFEVkH028825; Mon, 28 Nov 2016 15:14:31 GMT (envelope-from avg@FreeBSD.org) Message-Id: <201611281514.uASFEVkH028825@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: avg set sender to avg@FreeBSD.org using -f From: Andriy Gapon Date: Mon, 28 Nov 2016 15:14:31 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r309250 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Nov 2016 15:14:32 -0000 Author: avg Date: Mon Nov 28 15:14:31 2016 New Revision: 309250 URL: https://svnweb.freebsd.org/changeset/base/309250 Log: MFV r309249: 3821 Race in rollback, zil close, and zil flush Note: there was a merge conflict resolved by me. illumos/illumos-gate@43297f973a3543e7403ac27076490ab958a94b15 https://github.com/illumos/illumos-gate/commit/43297f973a3543e7403ac27076490ab958a94b15 https://www.illumos.org/issues/3821 We recently had nodes with some of the latest zfs bits panic on us in a rollback-heavy environment. The following is from my preliminary analysis: Let's look at where we died: > $C ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1) ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1) ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1) ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1) ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680) ffffff01ea6b9c30 thread_start+8() If we dig in we can find that this dataset corresponds to a zone: > ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233- 9e378e89877b" ] Okay so we have a null taskq pointer. That only happens during the calls to zil_open and zil_close. If we poke around we can see that we're actually in midst of a rollback: > ::pgrep zfs | ::printf "0x%x %s\\n" proc_t . p_user.u_psargs 0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d- 4b8ab06213c2@marlin_init 0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233- 9e378e89877b@marlin_init 0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc- 4ac4ba8fe9f8@marlin_init 0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e- 10cd45ff8e9f@marlin_init 0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c- 6d1b2751e6f5@marlin_init 0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24- 2ed45cd0da70@marlin_init Reviewed by: Matthew Ahrens Reviewed by: Dan Kimmel Reviewed by: Pavel Zakharov Reviewed by: Andriy Gapon Approved by: Richard Lowe Author: George Wilson MFC after: 3 weeks Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Directory Properties: head/sys/cddl/contrib/opensolaris/ (props changed) Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c Mon Nov 28 15:09:58 2016 (r309249) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c Mon Nov 28 15:14:31 2016 (r309250) @@ -20,7 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2011, 2015 by Delphix. All rights reserved. + * Copyright (c) 2011, 2016 by Delphix. All rights reserved. * Copyright (c) 2013 Steven Hartland. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. * Copyright (c) 2014 Integros [integros.com] @@ -681,9 +681,16 @@ dsl_pool_sync_done(dsl_pool_t *dp, uint6 { zilog_t *zilog; - while (zilog = txg_list_remove(&dp->dp_dirty_zilogs, txg)) { + while (zilog = txg_list_head(&dp->dp_dirty_zilogs, txg)) { dsl_dataset_t *ds = dmu_objset_ds(zilog->zl_os); + /* + * We don't remove the zilog from the dp_dirty_zilogs + * list until after we've cleaned it. This ensures that + * callers of zilog_is_dirty() receive an accurate + * answer when they are racing with the spa sync thread. + */ zil_clean(zilog, txg); + (void) txg_list_remove_this(&dp->dp_dirty_zilogs, zilog, txg); ASSERT(!dmu_objset_is_dirty(zilog->zl_os, txg)); dmu_buf_rele(ds->ds_dbuf, zilog); } Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Mon Nov 28 15:09:58 2016 (r309249) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Mon Nov 28 15:14:31 2016 (r309250) @@ -7086,8 +7086,6 @@ spa_sync(spa_t *spa, uint64_t txg) spa->spa_config_syncing = NULL; } - spa->spa_ubsync = spa->spa_uberblock; - dsl_pool_sync_done(dp, txg); mutex_enter(&spa->spa_alloc_lock); @@ -7112,6 +7110,13 @@ spa_sync(spa_t *spa, uint64_t txg) spa->spa_sync_pass = 0; + /* + * Update the last synced uberblock here. We want to do this at + * the end of spa_sync() so that consumers of spa_last_synced_txg() + * will be guaranteed that all the processing associated with + * that txg has been completed. + */ + spa->spa_ubsync = spa->spa_uberblock; spa_config_exit(spa, SCL_CONFIG, FTAG); spa_handle_ignored_writes(spa); Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Mon Nov 28 15:09:58 2016 (r309249) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Mon Nov 28 15:14:31 2016 (r309250) @@ -20,8 +20,7 @@ */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. - * Copyright (c) 2011, 2015 by Delphix. All rights reserved. - * Copyright (c) 2011, 2014 by Delphix. All rights reserved. + * Copyright (c) 2011, 2016 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -502,6 +501,27 @@ zilog_dirty(zilog_t *zilog, uint64_t txg } } +/* + * Determine if the zil is dirty in the specified txg. Callers wanting to + * ensure that the dirty state does not change must hold the itxg_lock for + * the specified txg. Holding the lock will ensure that the zil cannot be + * dirtied (zil_itx_assign) or cleaned (zil_clean) while we check its current + * state. + */ +boolean_t +zilog_is_dirty_in_txg(zilog_t *zilog, uint64_t txg) +{ + dsl_pool_t *dp = zilog->zl_dmu_pool; + + if (txg_list_member(&dp->dp_dirty_zilogs, zilog, txg & TXG_MASK)) + return (B_TRUE); + return (B_FALSE); +} + +/* + * Determine if the zil is dirty. The zil is considered dirty if it has + * any pending itx records that have not been cleaned by zil_clean(). + */ boolean_t zilog_is_dirty(zilog_t *zilog) { @@ -1062,8 +1082,6 @@ zil_lwb_commit(zilog_t *zilog, itx_t *it return (NULL); ASSERT(lwb->lwb_buf != NULL); - ASSERT(zilog_is_dirty(zilog) || - spa_freeze_txg(zilog->zl_spa) != UINT64_MAX); if (lrc->lrc_txtype == TX_WRITE && itx->itx_wr_state == WR_NEED_COPY) dlen = P2ROUNDUP_TYPED( @@ -1401,6 +1419,11 @@ zil_get_commit_list(zilog_t *zilog) else otxg = spa_last_synced_txg(zilog->zl_spa) + 1; + /* + * This is inherently racy, since there is nothing to prevent + * the last synced txg from changing. That's okay since we'll + * only commit things in the future. + */ for (txg = otxg; txg < (otxg + TXG_CONCURRENT_STATES); txg++) { itxg_t *itxg = &zilog->zl_itxg[txg & TXG_MASK]; @@ -1410,6 +1433,16 @@ zil_get_commit_list(zilog_t *zilog) continue; } + /* + * If we're adding itx records to the zl_itx_commit_list, + * then the zil better be dirty in this "txg". We can assert + * that here since we're holding the itxg_lock which will + * prevent spa_sync from cleaning it. Once we add the itxs + * to the zl_itx_commit_list we must commit it to disk even + * if it's unnecessary (i.e. the txg was synced). + */ + ASSERT(zilog_is_dirty_in_txg(zilog, txg) || + spa_freeze_txg(zilog->zl_spa) != UINT64_MAX); list_move_tail(commit_list, &itxg->itxg_itxs->i_sync_list); mutex_exit(&itxg->itxg_lock); @@ -1432,6 +1465,10 @@ zil_async_to_sync(zilog_t *zilog, uint64 else otxg = spa_last_synced_txg(zilog->zl_spa) + 1; + /* + * This is inherently racy, since there is nothing to prevent + * the last synced txg from changing. + */ for (txg = otxg; txg < (otxg + TXG_CONCURRENT_STATES); txg++) { itxg_t *itxg = &zilog->zl_itxg[txg & TXG_MASK]; @@ -1503,8 +1540,14 @@ zil_commit_writer(zilog_t *zilog) DTRACE_PROBE1(zil__cw1, zilog_t *, zilog); while (itx = list_head(&zilog->zl_itx_commit_list)) { txg = itx->itx_lr.lrc_txg; - ASSERT(txg); + ASSERT3U(txg, !=, 0); + /* + * This is inherently racy and may result in us writing + * out a log block for a txg that was just synced. This is + * ok since we'll end cleaning up that log block the next + * time we call zil_sync(). + */ if (txg > spa_last_synced_txg(spa) || txg > spa_freeze_txg(spa)) lwb = zil_lwb_commit(zilog, itx, lwb); list_remove(&zilog->zl_itx_commit_list, itx); @@ -1821,7 +1864,10 @@ zil_close(zilog_t *zilog) mutex_exit(&zilog->zl_lock); if (txg) txg_wait_synced(zilog->zl_dmu_pool, txg); - ASSERT(!zilog_is_dirty(zilog)); + + if (zilog_is_dirty(zilog)) + zfs_dbgmsg("zil (%p) is dirty, txg %llu", zilog, txg); + VERIFY(!zilog_is_dirty(zilog)); taskq_destroy(zilog->zl_clean_taskq); zilog->zl_clean_taskq = NULL;