From owner-freebsd-fs@FreeBSD.ORG  Mon Jun 22 12:30:35 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@nevdull.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 3979F5B8
 for <freebsd-fs@nevdull.freebsd.org>; Mon, 22 Jun 2015 12:30:35 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from hub.freebsd.org (hub.freebsd.org [8.8.178.136])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "hub.freebsd.org", Issuer "hub.freebsd.org" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 1A4FFBF6
 for <freebsd-fs@FreeBSD.ORG>; Mon, 22 Jun 2015 12:30:35 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: by hub.freebsd.org (Postfix)
 id 0FA475B7; Mon, 22 Jun 2015 12:30:35 +0000 (UTC)
Delivered-To: fs@nevdull.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0ED265B6
 for <fs@nevdull.freebsd.org>; Mon, 22 Jun 2015 12:30:35 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from smtp.digiware.nl (unknown [IPv6:2001:4cb8:90:ffff::3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id C718CBF4
 for <fs@freebsd.org>; Mon, 22 Jun 2015 12:30:34 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from rack1.digiware.nl (unknown [127.0.0.1])
 by smtp.digiware.nl (Postfix) with ESMTP id B8E3516A403;
 Mon, 22 Jun 2015 14:30:29 +0200 (CEST)
X-Virus-Scanned: amavisd-new at digiware.nl
Received: from smtp.digiware.nl ([127.0.0.1])
 by rack1.digiware.nl (rack1.digiware.nl [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id rV8O8kcg-_kh; Mon, 22 Jun 2015 14:30:02 +0200 (CEST)
Received: from [192.168.101.176] (vpn.ecoracks.nl [31.223.170.173])
 by smtp.digiware.nl (Postfix) with ESMTPA id 0AFAB16A401;
 Mon, 22 Jun 2015 14:30:02 +0200 (CEST)
Message-ID: <5587FFCC.3080100@digiware.nl>
Date: Mon, 22 Jun 2015 14:30:04 +0200
From: Willem Jan Withagen <wjw@digiware.nl>
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64;
 rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Quartz <quartz@sneakertech.com>, 
 Michelle Sullivan <michelle@sorbs.net>
CC: fs@freebsd.org
Subject: Re: This diskfailure should not panic a system, but just disconnect
 disk from ZFS
References: <5585767B.4000206@digiware.nl> <5587236A.6020404@sneakertech.com>
 <558769B5.601@sorbs.net> <55877393.3040704@sneakertech.com>
In-Reply-To: <55877393.3040704@sneakertech.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Jun 2015 12:30:35 -0000

On 22/06/2015 04:31, Quartz wrote:
>>> You have a raidz2, which means THREE disks need to go down before the
>>> pool is unwritable. The problem is most likely your controller or
>>> power supply, not your disks.
>>>
>> Never make such assumptions...
>>
>> I have worked in a professional environment where 9 of 12 disks failed
>> within 24 hours of each other....
> 
> Right... but if that was his problem there should be some logs of the
> other drives going down first, and typically ZFS would correctly mark
> the pool as degraded (at least, it would in my testing). The fact that
> ZFS didn't get a chance to log anything and the pool came back up
> healthy leads me to believe the controller went south, taking several
> disks with it all at once and totally borking all IO. (Either that or
> what Tom Curry mentioned about the Arc issue, which I wasn't previously
> aware of).
> 
> Of course, if it issue isn't repeatable then who knows....

I do not think it was a full out failure, but just one transaction that
got hit by an alpha-particle...

Well, remember that the hung-diagnostics timeout is 1000 sec.
In the time-span before the panic nothing else was logged about
disks/controllers/etc... not functioning..

Only the few secs before the panic ctl/iSCSI and the network interface
started complaining that the was a memory shortage and the
networkinterafce started dumping packets....

But all that was logged really nicely in syslog. So I think that in the
1000sec it took for the deadman switch to trigger, the zpool just
functioned as was expected.... And the hardware somewhere lost one
transaction.

So I'll be crossing my fingers, and we'll see when/what/where the next
crash in going to occur. And work from there....

--WjW