Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 29 Jun 2017 15:36:45 +0200
From:      Ben RUBSON <ben.rubson@gmail.com>
To:        Freebsd fs <freebsd-fs@freebsd.org>, freebsd-scsi@freebsd.org
Subject:   Re: I/O to pool appears to be hung, panic !
Message-ID:  <A1CC7D73-4196-4503-9716-52E84AA24FD3@gmail.com>
In-Reply-To: <20170629144334.1e283570@fabiankeil.de>
References:  <E8CC223E-3F41-4036-84A9-FBA693AC2CAA@gmail.com> <20170629144334.1e283570@fabiankeil.de>

next in thread | previous in thread | raw e-mail | index | archive | help
> On 29 Jun 2017, at 14:43, Fabian Keil <freebsd-listen@fabiankeil.de> =
wrote:

Thank you for your feedback Fabian.

> Ben RUBSON <ben.rubson@gmail.com> wrote:
>=20
>> One of my servers did a kernel panic last night, giving the following =
message :
>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at =
'/dev/label/G23iscsi'.
> [...]=20
>> Here are some numbers regarding this disk, taken from the server =
hosting the pool :
>> (unfortunately not from the iscsi target server)
>> https://s23.postimg.org/zd8jy9xaj/busydisk.png
>>=20
>> We clearly see that suddendly, disk became 100% busy, meanwhile CPU =
was almost idle.
>>=20
>> No error message at all on both servers.
> [...]
>> The only log I have is the following stacktrace taken from the server =
console :
>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at =
'/dev/label/G23iscsi'.
>> cpuid =3D 0
>> KDB: stack backtrace:
>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67
>> #1 0xffffffff80ad9462 at vpanic+0x182
>> #2 0xffffffff80ad92d3 at panic+0x43
>> #3 0xffffffff82238fa7 at vdev_deadman+0x127
>> #4 0xffffffff82238ec0 at vdev_deadman+0x40
>> #5 0xffffffff82238ec0 at vdev_deadman+0x40
>> #6 0xffffffff8222d0a6 at spa_deadman+0x86
>> #7 0xffffffff80af32da at softclock_call_cc+0x18a
>> #8 0xffffffff80af3854 at softclock+0x94
>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f
>> #10 0xffffffff80a936f6 at ithread_loop+0xc6
>> #11 0xffffffff80a900d5 at fork_exit+0x85
>> #12 0xffffffff80f846fe at fork_trampoline+0xe
>> Uptime: 92d2h47m6s
>>=20
>> I would have been pleased to make a dump available.
>> However, despite my (correct ?) configuration, server did not dump :
>> (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump)
>> # grep ^dump /boot/loader.conf /etc/rc.conf
>> /boot/loader.conf:dumpdev=3D"/dev/mirror/swap"
>> /etc/rc.conf:dumpdev=3D"AUTO"
>=20
> You may want to look at the NOTES section in gmirror(8).

Yes, I should already be OK (prefer algorithm set).

>> I use default kernel, with a rebuilt zfs module :
>> # uname -v
>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017     =
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC=20
>>=20
>> I use the following iSCSI configuration, which disconnects the disks =
"as soon as" they are unavailable :
>> kern.iscsi.ping_timeout=3D5
>> kern.iscsi.fail_on_disconnection=3D1
>> kern.iscsi.iscsid_timeout=3D5
>>=20
>> I then think disk was at least correctly reachable during these 20 =
busy minutes.
>>=20
>> So, any idea why I could have faced this issue ?
>=20
> Is it possible that the system was under memory pressure?

No I don't think it was :
https://s1.postimg.org/uvsebpyyn/busydisk2.png
More than 2GB of available memory.
Swap not used (624kB).
ARC behaviour seems correct (anon increases because ZFS can't actually =
write I think).
Regarding the pool itself, it was receiving data at 6MB/s, sending =
around 30kB blocks to disks.
When disk went busy, throughput fell to some kB, with 128kB blocks.

> geli's use of malloc() is known to cause deadlocks under memory =
pressure:
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209759
>=20
> Given that gmirror uses malloc() as well it probably has the same =
issue.

I don't use geli so I should not face this issue.

>> I would have thought ZFS would have taken the busy device offline, =
instead of raising a panic.
>> Perhaps it is already possible to make ZFS behave like this ?
>=20
> There's a tunable for this: vfs.zfs.deadman_enabled.
> If the panic is just a symptom of the deadlock it's unlikely
> to help though.

I think this tunable should have prevented the server from having raised =
a panic :
# sysctl -d vfs.zfs.deadman_enabled
vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O
# sysctl vfs.zfs.deadman_enabled
vfs.zfs.deadman_enabled: 1

But not sure how it would have behaved then...
(busy disk miraculously back to normal status, memory pressure due to =
anon increasing...)

I also tried to look for some LSI SAS2008 error counters (on target =
side),
but did not found anything interesting.
(sysctl -a | grep -i mps)

Thank you again,

Ben




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A1CC7D73-4196-4503-9716-52E84AA24FD3>