From owner-freebsd-fs@freebsd.org Thu Jun 29 13:00:44 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 22EEAD9D086 for ; Thu, 29 Jun 2017 13:00:44 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay04.ispgateway.de (smtprelay04.ispgateway.de [80.67.31.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DD76383AFE for ; Thu, 29 Jun 2017 13:00:43 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from [78.35.154.248] (helo=fabiankeil.de) by smtprelay04.ispgateway.de with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89) (envelope-from ) id 1dQYnu-0000BF-Jh; Thu, 29 Jun 2017 14:44:14 +0200 Date: Thu, 29 Jun 2017 14:43:34 +0200 From: Fabian Keil To: Ben RUBSON Cc: Freebsd fs Subject: Re: I/O to pool appears to be hung, panic ! Message-ID: <20170629144334.1e283570@fabiankeil.de> In-Reply-To: References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/OJk+2XZjp0fxqK3f3pBddNw"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Jun 2017 13:00:44 -0000 --Sig_/OJk+2XZjp0fxqK3f3pBddNw Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Ben RUBSON wrote: > One of my servers did a kernel panic last night, giving the following mes= sage : > panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev= /label/G23iscsi'. [...]=20 > Here are some numbers regarding this disk, taken from the server hosting = the pool : > (unfortunately not from the iscsi target server) > https://s23.postimg.org/zd8jy9xaj/busydisk.png >=20 > We clearly see that suddendly, disk became 100% busy, meanwhile CPU was a= lmost idle. >=20 > No error message at all on both servers. [...] > The only log I have is the following stacktrace taken from the server con= sole : > panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev= /label/G23iscsi'. > cpuid =3D 0 > KDB: stack backtrace: > #0 0xffffffff80b240f7 at kdb_backtrace+0x67 > #1 0xffffffff80ad9462 at vpanic+0x182 > #2 0xffffffff80ad92d3 at panic+0x43 > #3 0xffffffff82238fa7 at vdev_deadman+0x127 > #4 0xffffffff82238ec0 at vdev_deadman+0x40 > #5 0xffffffff82238ec0 at vdev_deadman+0x40 > #6 0xffffffff8222d0a6 at spa_deadman+0x86 > #7 0xffffffff80af32da at softclock_call_cc+0x18a > #8 0xffffffff80af3854 at softclock+0x94 > #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f > #10 0xffffffff80a936f6 at ithread_loop+0xc6 > #11 0xffffffff80a900d5 at fork_exit+0x85 > #12 0xffffffff80f846fe at fork_trampoline+0xe > Uptime: 92d2h47m6s >=20 > I would have been pleased to make a dump available. > However, despite my (correct ?) configuration, server did not dump : > (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump) > # grep ^dump /boot/loader.conf /etc/rc.conf > /boot/loader.conf:dumpdev=3D"/dev/mirror/swap" > /etc/rc.conf:dumpdev=3D"AUTO" You may want to look at the NOTES section in gmirror(8). =20 > I use default kernel, with a rebuilt zfs module : > # uname -v > FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 root@amd64-b= uilder.daemonology.net:/usr/obj/usr/src/sys/GENERIC=20 >=20 > I use the following iSCSI configuration, which disconnects the disks "as = soon as" they are unavailable : > kern.iscsi.ping_timeout=3D5 > kern.iscsi.fail_on_disconnection=3D1 > kern.iscsi.iscsid_timeout=3D5 >=20 > I then think disk was at least correctly reachable during these 20 busy m= inutes. >=20 > So, any idea why I could have faced this issue ? Is it possible that the system was under memory pressure? geli's use of malloc() is known to cause deadlocks under memory pressure: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209759 Given that gmirror uses malloc() as well it probably has the same issue. > I would have thought ZFS would have taken the busy device offline, instea= d of raising a panic. > Perhaps it is already possible to make ZFS behave like this ? There's a tunable for this: vfs.zfs.deadman_enabled. If the panic is just a symptom of the deadlock it's unlikely to help though. Fabian --Sig_/OJk+2XZjp0fxqK3f3pBddNw Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iF0EARECAB0WIQTKUNd6H/m3+ByGULIFiohV/3dUnQUCWVT19wAKCRAFiohV/3dU nSQ3AJ9bMFFwKvq/wxnYqy32gYYAX3Bb4gCeK9TU0cgG2uVaREVQnFLWnjC/E7A= =bXQU -----END PGP SIGNATURE----- --Sig_/OJk+2XZjp0fxqK3f3pBddNw--