From owner-freebsd-bugs@freebsd.org  Sun Feb  2 18:05:53 2020
Return-Path: <owner-freebsd-bugs@freebsd.org>
Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 43F0A235230
 for <freebsd-bugs@mailman.nyi.freebsd.org>;
 Sun,  2 Feb 2020 18:05:53 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3])
 by mx1.freebsd.org (Postfix) with ESMTP id 489f5Y15bCz4ShM
 for <freebsd-bugs@freebsd.org>; Sun,  2 Feb 2020 18:05:53 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: by mailman.nyi.freebsd.org (Postfix)
 id 2599C23522F; Sun,  2 Feb 2020 18:05:53 +0000 (UTC)
Delivered-To: bugs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 255EF23522E
 for <bugs@mailman.nyi.freebsd.org>; Sun,  2 Feb 2020 18:05:53 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org
 [IPv6:2610:1c1:1:606c::19:3])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 server-signature RSA-PSS (4096 bits)
 client-signature RSA-PSS (4096 bits) client-digest SHA256)
 (Client CN "mxrelay.nyi.freebsd.org",
 Issuer "Let's Encrypt Authority X3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 489f5Y0D1tz4ShL
 for <bugs@FreeBSD.org>; Sun,  2 Feb 2020 18:05:53 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2610:1c1:1:606c::50:1d])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 server-signature RSA-PSS (4096 bits))
 (Client did not present a certificate)
 by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 00A701BC07
 for <bugs@FreeBSD.org>; Sun,  2 Feb 2020 18:05:53 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org ([127.0.1.5])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 012I5qo6055200
 for <bugs@FreeBSD.org>; Sun, 2 Feb 2020 18:05:52 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
Received: (from www@localhost)
 by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 012I5qES055194
 for bugs@FreeBSD.org; Sun, 2 Feb 2020 18:05:52 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
X-Authentication-Warning: kenobi.freebsd.org: www set sender to
 bugzilla-noreply@freebsd.org using -f
From: bugzilla-noreply@freebsd.org
To: bugs@FreeBSD.org
Subject: [Bug 243814] ZFS deadlock when adding cache partition
Date: Sun, 02 Feb 2020 18:05:52 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 12.1-RELEASE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: jfc@mit.edu
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: bugs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform
 op_sys bug_status bug_severity priority component assigned_to reporter
Message-ID: <bug-243814-227@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs/>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 02 Feb 2020 18:05:53 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243814

            Bug ID: 243814
           Summary: ZFS deadlock when adding cache partition
           Product: Base System
           Version: 12.1-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: jfc@mit.edu

My system hung in ZFS, probably a deadlock.  I doubt this is reproducible b=
ut
there have been scattered other reports so I'll add more data.

The machine is an HPE ProLiant with AMD CPU, 96 GB, UFS root on NVME, a ZFS
mirrored pool on two spinning disks, and a ZFS raidz2 pool on five spinning
disks encrypted with full-disk geli.  Both pools have failmode=3Dcontinue. =
 The
mirrored pool was idle.  The raidz2 pool caused the hang.

zpool add ... cache ... was hung waiting on tx->tx_sync_done_cv.  Meanwhile=
, a
zpool iostat was hung waiting on spa_namespace_lock.  shutdown -r failed to
reboot, probably because of a deadlock.  After the 90 second timeout in
rc.shutdown init attempted to go into single user mode but never started a
shell.  I had to power cycle the system.

Now back in time to the setup.

I was moving a lot of data into the pool and from filesystem to filesystem
within the pool.  I noticed the transfer tended to stop for minutes at a ti=
me,
and ZFS administrative actions and the sync command also took several minut=
es.=20
Probably waiting for many gigabytes of dirty buffers to write.  I thought
adding a cache drive might help, so I created a partition on the NVME drive=
 and
ran

# zpool add private cache nvd0p4  ("private" is the name of my pool)

And that's where everything went wrong.  Instead of a 1-10 minute wait for
buffers to drain I/O to the pool was totally hung.

Control-T reported
load: 3.47  cmd: zpool 79137 [scl->scl_cv] 1.09r 0.00u 0.00s 0% 4284k
load: 3.47  cmd: zpool 79137 [scl->scl_cv] 2.32r 0.00u 0.00s 0% 4284k
load: 3.19  cmd: zpool 79137 [scl->scl_cv] 6.86r 0.00u 0.00s 0% 4284k
load: 1.61  cmd: zpool 79137 [tx->tx_sync_done_cv] 500.21r 0.00u 0.00s 0% 4=
284k
load: 0.96  cmd: zpool 79137 [tx->tx_sync_done_cv] 566.71r 0.00u 0.00s 0% 4=
284k
load: 0.75  cmd: zpool 79137 [tx->tx_sync_done_cv] 577.43r 0.00u 0.00s 0% 4=
284k
load: 0.61  cmd: zpool 79137 [tx->tx_sync_done_cv] 611.66r 0.00u 0.00s 0% 4=
284k
load: 0.14  cmd: zpool 79137 [tx->tx_sync_done_cv] 736.87r 0.00u 0.00s 0% 4=
284k
load: 0.05  cmd: zpool 79137 [tx->tx_sync_done_cv] 792.23r 0.00u 0.00s 0% 4=
284k
load: 0.18  cmd: zpool 79137 [tx->tx_sync_done_cv] 997.17r 0.00u 0.00s 0% 4=
284k
load: 0.37  cmd: zpool 79137 [tx->tx_sync_done_cv] 1198.80r 0.00u 0.00s 0%
4284k
load: 0.49  cmd: zpool 79137 [tx->tx_sync_done_cv] 1339.23r 0.00u 0.00s 0%
4284k

Meanwhile in another shell, ^T to a zpool iostat reported
load: 0.24  cmd: zpool 50732 [spa_namespace_lock] 179170.71r 0.01u 0.08s 0%
1068k

/var/log/messages has these lines that seem relevant.  I created a partition
intended for cache and then resized it before using it.  That seems to crea=
te
the error 6 condition; I've seen it elsewhere but without any side effects.

kernel: g_access(958): provider gptid/af44cd24-36ef-11ea-a744-48df37a69238 =
has
error 6 set
syslogd: last message repeated 2 times
kernel: g_dev_taste: make_dev_p() failed
(gp->name=3Dgptid/af44cd24-36ef-11ea-a744-48df37a69238, error=3D17)
ZFS[79140]: vdev state changed, pool_guid=3D$706653905921838876
vdev_guid=3D$2416291949121178716
ZFS[79141]: vdev state changed, pool_guid=3D$706653905921838876
vdev_guid=3D$2416291949121178716

The partition UUID is for the cache partition and the pool UUID is for the =
pool
I was using.  (There is also a second pool that was mounted but idle.)  But=
 I
have failmode=3Dcontinue so an I/O error should not hang the system.

--=20
You are receiving this mail because:
You are the assignee for the bug.=