From owner-freebsd-fs@FreeBSD.ORG  Sat Jun 20 22:14:48 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@hub.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 3919E6AC
 for <fs@hub.freebsd.org>; Sat, 20 Jun 2015 22:14:48 +0000 (UTC)
 (envelope-from swills@mouf.net)
Received: from mouf.net (mouf.net [IPv6:2607:fc50:0:4400:216:3eff:fe69:33b3])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "mouf.net", Issuer "mouf.net" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id EC722280
 for <fs@freebsd.org>; Sat, 20 Jun 2015 22:14:47 +0000 (UTC)
 (envelope-from swills@mouf.net)
Received: from mouf.net (swills@mouf [199.48.129.64])
 by mouf.net (8.14.9/8.14.9) with ESMTP id t5KMEXFd031117
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT);
 Sat, 20 Jun 2015 22:14:38 GMT (envelope-from swills@mouf.net)
Received: (from swills@localhost)
 by mouf.net (8.14.9/8.14.9/Submit) id t5KMEWiS031116;
 Sat, 20 Jun 2015 22:14:32 GMT (envelope-from swills)
Date: Sat, 20 Jun 2015 22:14:32 +0000
From: Steve Wills <swills@FreeBSD.org>
To: Willem Jan Withagen <wjw@digiware.nl>
Cc: fs@freebsd.org
Subject: Re: This diskfailure should not panic a system, but just disconnect
 disk from ZFS
Message-ID: <20150620221431.GB26416@mouf.net>
References: <5585767B.4000206@digiware.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5585767B.4000206@digiware.nl>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (mouf.net [199.48.129.64]); Sat, 20 Jun 2015 22:14:38 +0000 (UTC)
X-Spam-Status: No, score=0.0 required=4.5 tests=HEADER_FROM_DIFFERENT_DOMAINS
 autolearn=unavailable autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on mouf.net
X-Virus-Scanned: clamav-milter 0.98.7 at mouf.net
X-Virus-Status: Clean
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Jun 2015 22:14:48 -0000

On Sat, Jun 20, 2015 at 04:19:39PM +0200, Willem Jan Withagen wrote:
> Hi,
> 
> Found my system rebooted this morning:
> 
> Jun 20 05:28:33 zfs kernel: sonewconn: pcb 0xfffff8011b6da498: Listen
> queue overflow: 8 already in queue awaiting acceptance (48 occurrences)
> Jun 20 05:28:33 zfs kernel: panic: I/O to pool 'zfsraid' appears to be
> hung on vdev guid 18180224580327100979 at '/dev/da0'.
> Jun 20 05:28:33 zfs kernel: cpuid = 0
> Jun 20 05:28:33 zfs kernel: Uptime: 8d9h7m9s
> Jun 20 05:28:33 zfs kernel: Dumping 6445 out of 8174
> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
> 
> Which leads me to believe that /dev/da0 went out on vacation, leaving
> ZFS into trouble.... But the array is:
> ----
> NAME               SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP
> zfsraid           32.5T  13.3T  19.2T         -     7%    41%  1.00x
> ONLINE  -
>   raidz2          16.2T  6.67T  9.58T         -     8%    41%
>     da0               -      -      -         -      -      -
>     da1               -      -      -         -      -      -
>     da2               -      -      -         -      -      -
>     da3               -      -      -         -      -      -
>     da4               -      -      -         -      -      -
>     da5               -      -      -         -      -      -
>   raidz2          16.2T  6.67T  9.58T         -     7%    41%
>     da6               -      -      -         -      -      -
>     da7               -      -      -         -      -      -
>     ada4              -      -      -         -      -      -
>     ada5              -      -      -         -      -      -
>     ada6              -      -      -         -      -      -
>     ada7              -      -      -         -      -      -
>   mirror           504M  1.73M   502M         -    39%     0%
>     gpt/log0          -      -      -         -      -      -
>     gpt/log1          -      -      -         -      -      -
> cache                 -      -      -      -      -      -
>   gpt/raidcache0   109G  1.34G   107G         -     0%     1%
>   gpt/raidcache1   109G   787M   108G         -     0%     0%
> ----
> 
> And thus I'd would have expected that ZFS would disconnect /dev/da0 and
> then switch to DEGRADED state and continue, letting the operator fix the
> broken disk.
> Instead it chooses to panic, which is not a nice thing to do. :)
> 
> Or do I have to high hopes of ZFS?
> 
> Next question to answer is why this WD RED on:
> 
> arcmsr0@pci0:7:14:0:    class=0x010400 card=0x112017d3 chip=0x112017d3
> rev=0x00 hdr=0x00
>     vendor     = 'Areca Technology Corp.'
>     device     = 'ARC-1120 8-Port PCI-X to SATA RAID Controller'
>     class      = mass storage
>     subclass   = RAID
> 
> got hung, and nothing for this shows in SMART....
> 

You may be hitting the zfs deadman panic, which is triggered when the
controller hangs. This can in some cases be caused by disks that die in unusual
ways.
 
> > (If needed vmcore available) > 

The backtrace might confirm or dispute my theory.

Steve