From owner-freebsd-fs@freebsd.org Thu Jun 29 11:49:05 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5E4B3D9B030; Thu, 29 Jun 2017 11:49:05 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x236.google.com (mail-wm0-x236.google.com [IPv6:2a00:1450:400c:c09::236]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DF8178152D; Thu, 29 Jun 2017 11:49:04 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x236.google.com with SMTP id w126so79655522wme.0; Thu, 29 Jun 2017 04:49:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:subject:message-id:date:to :mime-version; bh=/0PeDsqUvWY8Rmc0HQnOBmBLDJcNSSBrJMtjHVZ6l44=; b=MqL772KK6OUCZ5UIA3Zd8ub2P189RwetjkCP4Eq+B6NwMecz8Pu521g6UqLfItQad/ N5TqNeBmcSGzOrhe3qSDZKGJv3x3zGO1WLWdSSLRbZN07bIdG0FgU3Gf0OuHt0Tffj+h QNc2UImcMRNrT6BdgZY2Fdrsr/yazQh5/yuDzTEHug1QvEZ1QLzmL8j3LF5YTm7E7JEh FHAl6FM+WI+0n/ibEioAb/w+nrVLPbQG2IjLJL70yIIbq/J+Z+uW/1Kfc5ZNCShVPqvg pUPueR0lBehmB6V/99APUoQ/uIqVXGMhfYV5LnfU2Mrn1vISutM3JNn1J7ULaCc+q4e4 jvXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:subject :message-id:date:to:mime-version; bh=/0PeDsqUvWY8Rmc0HQnOBmBLDJcNSSBrJMtjHVZ6l44=; b=E0Ex8duggTxTwIFvuj8Bm2xIoq2fPS2Jq8MyrPuVOK2hLN+DRfOUzWuFgS6xgOlLZd dafCyvU2Z/5AIjy8FJdJwrIWq+sawdGWEPxeQH6gHiWD/3dIAY1UPBH8McHSl2fkWwgk pKH5/vmC/v6+YhwFUam3L/ZXjQDi3PMFCI4Yi3qQO2aM0F7/5lJQaDOW9rd4OgCwpUtV U/tN7XjPpV+5ALPk9jaL66kYGQJvnBwdINCDvVYeuqhBAxjqac57Q4S6ONg4gmhEctRf ydm7uWUR5eb7epJvQeT9DzoMQ2NyrmCv1vxjW4yQglJ4IgZfpngsEutXmLhhyF9gOGv8 Nubw== X-Gm-Message-State: AIVw113MC5e1+WQMCq0C6Yf4xRLduPmLDZ5g66oIU/0D5A8CLChNYQSY d/YvwCedvHurm4rZ+y4= X-Received: by 10.28.211.10 with SMTP id k10mr1566474wmg.117.1498736942396; Thu, 29 Jun 2017 04:49:02 -0700 (PDT) Received: from ben.home (LFbn-1-11339-180.w2-15.abo.wanadoo.fr. [2.15.165.180]) by smtp.gmail.com with ESMTPSA id r142sm1148896wmg.24.2017.06.29.04.49.00 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 29 Jun 2017 04:49:01 -0700 (PDT) From: Ben RUBSON Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: I/O to pool appears to be hung, panic ! Message-Id: Date: Thu, 29 Jun 2017 13:48:59 +0200 To: Freebsd fs , freebsd-scsi@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Jun 2017 11:49:05 -0000 Hello, One of my servers did a kernel panic last night, giving the following = message : panic: I/O to pool 'home' appears to be hung on vdev guid 122... at = '/dev/label/G23iscsi'. Pool is made like this : home mirror label/G13local label/G14local label/G23iscsi <-- busy disk label/G24iscsi mirror label/G15local label/G16local label/G25iscsi label/G26iscsi cache label/G10local label/G11local Kernel is then complaining about one of the 4 iSCSI disks in the pool. All these 4 disks come from another identical FreeBSD system (40G "no = latency" link). Here are some numbers regarding this disk, taken from the server hosting = the pool : (unfortunately not from the iscsi target server) https://s23.postimg.org/zd8jy9xaj/busydisk.png We clearly see that suddendly, disk became 100% busy, meanwhile CPU was = almost idle. No error message at all on both servers. SMART from the target disk is nice : SMART Health Status: OK Current Drive Temperature: 32 C Drive Trip Temperature: 85 C Manufactured in week 22 of year 2016 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 18 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 2362 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator =3D 5938879802638336 Error counter log: Errors Corrected by Total Correction = Gigabytes Total ECC rereads/ errors algorithm = processed uncorrected fast | delayed rewrites corrected invocations [10^9 = bytes] errors read: 0 14 0 14 488481 74496.712 = 0 write: 0 0 0 0 126701 18438.443 = 0 verify: 0 0 0 0 20107 0.370 = 0 Non-medium error count: 0 SMART Self-test log Num Test Status segment LifeTime = LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed - 7943 = - [- - -] # 2 Background long Completed - 7607 = - [- - -] # 3 Background long Completed - 7271 = - [- - -] The only log I have is the following stacktrace taken from the server = console : panic: I/O to pool 'home' appears to be hung on vdev guid 122... at = '/dev/label/G23iscsi'. cpuid =3D 0 KDB: stack backtrace: #0 0xffffffff80b240f7 at kdb_backtrace+0x67 #1 0xffffffff80ad9462 at vpanic+0x182 #2 0xffffffff80ad92d3 at panic+0x43 #3 0xffffffff82238fa7 at vdev_deadman+0x127 #4 0xffffffff82238ec0 at vdev_deadman+0x40 #5 0xffffffff82238ec0 at vdev_deadman+0x40 #6 0xffffffff8222d0a6 at spa_deadman+0x86 #7 0xffffffff80af32da at softclock_call_cc+0x18a #8 0xffffffff80af3854 at softclock+0x94 #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f #10 0xffffffff80a936f6 at ithread_loop+0xc6 #11 0xffffffff80a900d5 at fork_exit+0x85 #12 0xffffffff80f846fe at fork_trampoline+0xe Uptime: 92d2h47m6s I would have been pleased to make a dump available. However, despite my (correct ?) configuration, server did not dump : (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump) # grep ^dump /boot/loader.conf /etc/rc.conf /boot/loader.conf:dumpdev=3D"/dev/mirror/swap" /etc/rc.conf:dumpdev=3D"AUTO" # gmirror list swap Components: 2 Balance: prefer Providers: 1. Name: mirror/swap Mediasize: 8589934080 (8.0G) Consumers: 1. Name: label/swap1 State: ACTIVE Priority: 0 2. Name: label/swap2 State: ACTIVE Priority: 1 I use default kernel, with a rebuilt zfs module : # uname -v FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 = root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC=20 I use the following iSCSI configuration, which disconnects the disks "as = soon as" they are unavailable : kern.iscsi.ping_timeout=3D5 kern.iscsi.fail_on_disconnection=3D1 kern.iscsi.iscsid_timeout=3D5 I then think disk was at least correctly reachable during these 20 busy = minutes. So, any idea why I could have faced this issue ? I would have thought ZFS would have taken the busy device offline, = instead of raising a panic. Perhaps it is already possible to make ZFS behave like this ? Thank you very much for your help & support ! Best regards, Ben