From owner-freebsd-fs@freebsd.org  Thu Jun 29 13:00:44 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 22EEAD9D086
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 29 Jun 2017 13:00:44 +0000 (UTC)
 (envelope-from freebsd-listen@fabiankeil.de)
Received: from smtprelay04.ispgateway.de (smtprelay04.ispgateway.de
 [80.67.31.32])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DD76383AFE
 for <freebsd-fs@freebsd.org>; Thu, 29 Jun 2017 13:00:43 +0000 (UTC)
 (envelope-from freebsd-listen@fabiankeil.de)
Received: from [78.35.154.248] (helo=fabiankeil.de)
 by smtprelay04.ispgateway.de with esmtpsa
 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89)
 (envelope-from <freebsd-listen@fabiankeil.de>)
 id 1dQYnu-0000BF-Jh; Thu, 29 Jun 2017 14:44:14 +0200
Date: Thu, 29 Jun 2017 14:43:34 +0200
From: Fabian Keil <freebsd-listen@fabiankeil.de>
To: Ben RUBSON <ben.rubson@gmail.com>
Cc: Freebsd fs <freebsd-fs@freebsd.org>
Subject: Re: I/O to pool appears to be hung, panic !
Message-ID: <20170629144334.1e283570@fabiankeil.de>
In-Reply-To: <E8CC223E-3F41-4036-84A9-FBA693AC2CAA@gmail.com>
References: <E8CC223E-3F41-4036-84A9-FBA693AC2CAA@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 boundary="Sig_/OJk+2XZjp0fxqK3f3pBddNw"; protocol="application/pgp-signature"
X-Df-Sender: Nzc1MDY3
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jun 2017 13:00:44 -0000

--Sig_/OJk+2XZjp0fxqK3f3pBddNw
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Ben RUBSON <ben.rubson@gmail.com> wrote:

> One of my servers did a kernel panic last night, giving the following mes=
sage :
> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev=
/label/G23iscsi'.
[...]=20
> Here are some numbers regarding this disk, taken from the server hosting =
the pool :
> (unfortunately not from the iscsi target server)
> https://s23.postimg.org/zd8jy9xaj/busydisk.png
>=20
> We clearly see that suddendly, disk became 100% busy, meanwhile CPU was a=
lmost idle.
>=20
> No error message at all on both servers.
[...]
> The only log I have is the following stacktrace taken from the server con=
sole :
> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev=
/label/G23iscsi'.
> cpuid =3D 0
> KDB: stack backtrace:
> #0 0xffffffff80b240f7 at kdb_backtrace+0x67
> #1 0xffffffff80ad9462 at vpanic+0x182
> #2 0xffffffff80ad92d3 at panic+0x43
> #3 0xffffffff82238fa7 at vdev_deadman+0x127
> #4 0xffffffff82238ec0 at vdev_deadman+0x40
> #5 0xffffffff82238ec0 at vdev_deadman+0x40
> #6 0xffffffff8222d0a6 at spa_deadman+0x86
> #7 0xffffffff80af32da at softclock_call_cc+0x18a
> #8 0xffffffff80af3854 at softclock+0x94
> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f
> #10 0xffffffff80a936f6 at ithread_loop+0xc6
> #11 0xffffffff80a900d5 at fork_exit+0x85
> #12 0xffffffff80f846fe at fork_trampoline+0xe
> Uptime: 92d2h47m6s
>=20
> I would have been pleased to make a dump available.
> However, despite my (correct ?) configuration, server did not dump :
> (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump)
> # grep ^dump /boot/loader.conf /etc/rc.conf
> /boot/loader.conf:dumpdev=3D"/dev/mirror/swap"
> /etc/rc.conf:dumpdev=3D"AUTO"

You may want to look at the NOTES section in gmirror(8).
=20
> I use default kernel, with a rebuilt zfs module :
> # uname -v
> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017     root@amd64-b=
uilder.daemonology.net:/usr/obj/usr/src/sys/GENERIC=20
>=20
> I use the following iSCSI configuration, which disconnects the disks "as =
soon as" they are unavailable :
> kern.iscsi.ping_timeout=3D5
> kern.iscsi.fail_on_disconnection=3D1
> kern.iscsi.iscsid_timeout=3D5
>=20
> I then think disk was at least correctly reachable during these 20 busy m=
inutes.
>=20
> So, any idea why I could have faced this issue ?

Is it possible that the system was under memory pressure?

geli's use of malloc() is known to cause deadlocks under memory pressure:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209759

Given that gmirror uses malloc() as well it probably has the same issue.

> I would have thought ZFS would have taken the busy device offline, instea=
d of raising a panic.
> Perhaps it is already possible to make ZFS behave like this ?

There's a tunable for this: vfs.zfs.deadman_enabled.
If the panic is just a symptom of the deadlock it's unlikely
to help though.

Fabian

--Sig_/OJk+2XZjp0fxqK3f3pBddNw
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----

iF0EARECAB0WIQTKUNd6H/m3+ByGULIFiohV/3dUnQUCWVT19wAKCRAFiohV/3dU
nSQ3AJ9bMFFwKvq/wxnYqy32gYYAX3Bb4gCeK9TU0cgG2uVaREVQnFLWnjC/E7A=
=bXQU
-----END PGP SIGNATURE-----

--Sig_/OJk+2XZjp0fxqK3f3pBddNw--