From owner-freebsd-fs@freebsd.org  Thu Jun 29 11:49:05 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5E4B3D9B030;
 Thu, 29 Jun 2017 11:49:05 +0000 (UTC)
 (envelope-from ben.rubson@gmail.com)
Received: from mail-wm0-x236.google.com (mail-wm0-x236.google.com
 [IPv6:2a00:1450:400c:c09::236])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id DF8178152D;
 Thu, 29 Jun 2017 11:49:04 +0000 (UTC)
 (envelope-from ben.rubson@gmail.com)
Received: by mail-wm0-x236.google.com with SMTP id w126so79655522wme.0;
 Thu, 29 Jun 2017 04:49:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:content-transfer-encoding:subject:message-id:date:to
 :mime-version; bh=/0PeDsqUvWY8Rmc0HQnOBmBLDJcNSSBrJMtjHVZ6l44=;
 b=MqL772KK6OUCZ5UIA3Zd8ub2P189RwetjkCP4Eq+B6NwMecz8Pu521g6UqLfItQad/
 N5TqNeBmcSGzOrhe3qSDZKGJv3x3zGO1WLWdSSLRbZN07bIdG0FgU3Gf0OuHt0Tffj+h
 QNc2UImcMRNrT6BdgZY2Fdrsr/yazQh5/yuDzTEHug1QvEZ1QLzmL8j3LF5YTm7E7JEh
 FHAl6FM+WI+0n/ibEioAb/w+nrVLPbQG2IjLJL70yIIbq/J+Z+uW/1Kfc5ZNCShVPqvg
 pUPueR0lBehmB6V/99APUoQ/uIqVXGMhfYV5LnfU2Mrn1vISutM3JNn1J7ULaCc+q4e4
 jvXQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:content-transfer-encoding:subject
 :message-id:date:to:mime-version;
 bh=/0PeDsqUvWY8Rmc0HQnOBmBLDJcNSSBrJMtjHVZ6l44=;
 b=E0Ex8duggTxTwIFvuj8Bm2xIoq2fPS2Jq8MyrPuVOK2hLN+DRfOUzWuFgS6xgOlLZd
 dafCyvU2Z/5AIjy8FJdJwrIWq+sawdGWEPxeQH6gHiWD/3dIAY1UPBH8McHSl2fkWwgk
 pKH5/vmC/v6+YhwFUam3L/ZXjQDi3PMFCI4Yi3qQO2aM0F7/5lJQaDOW9rd4OgCwpUtV
 U/tN7XjPpV+5ALPk9jaL66kYGQJvnBwdINCDvVYeuqhBAxjqac57Q4S6ONg4gmhEctRf
 ydm7uWUR5eb7epJvQeT9DzoMQ2NyrmCv1vxjW4yQglJ4IgZfpngsEutXmLhhyF9gOGv8
 Nubw==
X-Gm-Message-State: AIVw113MC5e1+WQMCq0C6Yf4xRLduPmLDZ5g66oIU/0D5A8CLChNYQSY
 d/YvwCedvHurm4rZ+y4=
X-Received: by 10.28.211.10 with SMTP id k10mr1566474wmg.117.1498736942396;
 Thu, 29 Jun 2017 04:49:02 -0700 (PDT)
Received: from ben.home (LFbn-1-11339-180.w2-15.abo.wanadoo.fr. [2.15.165.180])
 by smtp.gmail.com with ESMTPSA id r142sm1148896wmg.24.2017.06.29.04.49.00
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Thu, 29 Jun 2017 04:49:01 -0700 (PDT)
From: Ben RUBSON <ben.rubson@gmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: I/O to pool appears to be hung, panic !
Message-Id: <E8CC223E-3F41-4036-84A9-FBA693AC2CAA@gmail.com>
Date: Thu, 29 Jun 2017 13:48:59 +0200
To: Freebsd fs <freebsd-fs@freebsd.org>,
 freebsd-scsi@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jun 2017 11:49:05 -0000


Hello,

One of my servers did a kernel panic last night, giving the following =
message :
panic: I/O to pool 'home' appears to be hung on vdev guid 122... at =
'/dev/label/G23iscsi'.

Pool is made like this :
home
  mirror
    label/G13local
    label/G14local
    label/G23iscsi <-- busy disk
    label/G24iscsi
  mirror
    label/G15local
    label/G16local
    label/G25iscsi
    label/G26iscsi
cache
  label/G10local
  label/G11local

Kernel is then complaining about one of the 4 iSCSI disks in the pool.
All these 4 disks come from another identical FreeBSD system (40G "no =
latency" link).

Here are some numbers regarding this disk, taken from the server hosting =
the pool :
(unfortunately not from the iscsi target server)
https://s23.postimg.org/zd8jy9xaj/busydisk.png

We clearly see that suddendly, disk became 100% busy, meanwhile CPU was =
almost idle.

No error message at all on both servers.

SMART from the target disk is nice :
SMART Health Status: OK
Current Drive Temperature:     32 C
Drive Trip Temperature:        85 C
Manufactured in week 22 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2362
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator =3D 5938879802638336
Error counter log:
           Errors Corrected by           Total   Correction     =
Gigabytes    Total
               ECC          rereads/    errors   algorithm      =
processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 =
bytes]  errors
read:          0       14         0        14     488481      74496.712  =
         0
write:         0        0         0         0     126701      18438.443  =
         0
verify:        0        0         0         0      20107          0.370  =
         0
Non-medium error count:        0
SMART Self-test log
Num  Test              Status                 segment  LifeTime  =
LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -    7943             =
    - [-   -    -]
# 2  Background long   Completed                   -    7607             =
    - [-   -    -]
# 3  Background long   Completed                   -    7271             =
    - [-   -    -]

The only log I have is the following stacktrace taken from the server =
console :
panic: I/O to pool 'home' appears to be hung on vdev guid 122... at =
'/dev/label/G23iscsi'.
cpuid =3D 0
KDB: stack backtrace:
#0 0xffffffff80b240f7 at kdb_backtrace+0x67
#1 0xffffffff80ad9462 at vpanic+0x182
#2 0xffffffff80ad92d3 at panic+0x43
#3 0xffffffff82238fa7 at vdev_deadman+0x127
#4 0xffffffff82238ec0 at vdev_deadman+0x40
#5 0xffffffff82238ec0 at vdev_deadman+0x40
#6 0xffffffff8222d0a6 at spa_deadman+0x86
#7 0xffffffff80af32da at softclock_call_cc+0x18a
#8 0xffffffff80af3854 at softclock+0x94
#9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f
#10 0xffffffff80a936f6 at ithread_loop+0xc6
#11 0xffffffff80a900d5 at fork_exit+0x85
#12 0xffffffff80f846fe at fork_trampoline+0xe
Uptime: 92d2h47m6s

I would have been pleased to make a dump available.
However, despite my (correct ?) configuration, server did not dump :
(nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump)
# grep ^dump /boot/loader.conf /etc/rc.conf
/boot/loader.conf:dumpdev=3D"/dev/mirror/swap"
/etc/rc.conf:dumpdev=3D"AUTO"
# gmirror list swap
Components: 2
Balance: prefer
Providers:
1. Name: mirror/swap
   Mediasize: 8589934080 (8.0G)
Consumers:
1. Name: label/swap1
   State: ACTIVE
   Priority: 0
2. Name: label/swap2
   State: ACTIVE
   Priority: 1

I use default kernel, with a rebuilt zfs module :
# uname -v
FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017     =
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC=20

I use the following iSCSI configuration, which disconnects the disks "as =
soon as" they are unavailable :
kern.iscsi.ping_timeout=3D5
kern.iscsi.fail_on_disconnection=3D1
kern.iscsi.iscsid_timeout=3D5

I then think disk was at least correctly reachable during these 20 busy =
minutes.

So, any idea why I could have faced this issue ?
I would have thought ZFS would have taken the busy device offline, =
instead of raising a panic.
Perhaps it is already possible to make ZFS behave like this ?

Thank you very much for your help & support !

Best regards,

Ben