From owner-freebsd-fs@FreeBSD.ORG Mon Jun 22 12:30:35 2015 Return-Path: Delivered-To: freebsd-fs@nevdull.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3979F5B8 for ; Mon, 22 Jun 2015 12:30:35 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from hub.freebsd.org (hub.freebsd.org [8.8.178.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hub.freebsd.org", Issuer "hub.freebsd.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 1A4FFBF6 for ; Mon, 22 Jun 2015 12:30:35 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: by hub.freebsd.org (Postfix) id 0FA475B7; Mon, 22 Jun 2015 12:30:35 +0000 (UTC) Delivered-To: fs@nevdull.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0ED265B6 for ; Mon, 22 Jun 2015 12:30:35 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from smtp.digiware.nl (unknown [IPv6:2001:4cb8:90:ffff::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C718CBF4 for ; Mon, 22 Jun 2015 12:30:34 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from rack1.digiware.nl (unknown [127.0.0.1]) by smtp.digiware.nl (Postfix) with ESMTP id B8E3516A403; Mon, 22 Jun 2015 14:30:29 +0200 (CEST) X-Virus-Scanned: amavisd-new at digiware.nl Received: from smtp.digiware.nl ([127.0.0.1]) by rack1.digiware.nl (rack1.digiware.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rV8O8kcg-_kh; Mon, 22 Jun 2015 14:30:02 +0200 (CEST) Received: from [192.168.101.176] (vpn.ecoracks.nl [31.223.170.173]) by smtp.digiware.nl (Postfix) with ESMTPA id 0AFAB16A401; Mon, 22 Jun 2015 14:30:02 +0200 (CEST) Message-ID: <5587FFCC.3080100@digiware.nl> Date: Mon, 22 Jun 2015 14:30:04 +0200 From: Willem Jan Withagen User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Quartz , Michelle Sullivan CC: fs@freebsd.org Subject: Re: This diskfailure should not panic a system, but just disconnect disk from ZFS References: <5585767B.4000206@digiware.nl> <5587236A.6020404@sneakertech.com> <558769B5.601@sorbs.net> <55877393.3040704@sneakertech.com> In-Reply-To: <55877393.3040704@sneakertech.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Jun 2015 12:30:35 -0000 On 22/06/2015 04:31, Quartz wrote: >>> You have a raidz2, which means THREE disks need to go down before the >>> pool is unwritable. The problem is most likely your controller or >>> power supply, not your disks. >>> >> Never make such assumptions... >> >> I have worked in a professional environment where 9 of 12 disks failed >> within 24 hours of each other.... > > Right... but if that was his problem there should be some logs of the > other drives going down first, and typically ZFS would correctly mark > the pool as degraded (at least, it would in my testing). The fact that > ZFS didn't get a chance to log anything and the pool came back up > healthy leads me to believe the controller went south, taking several > disks with it all at once and totally borking all IO. (Either that or > what Tom Curry mentioned about the Arc issue, which I wasn't previously > aware of). > > Of course, if it issue isn't repeatable then who knows.... I do not think it was a full out failure, but just one transaction that got hit by an alpha-particle... Well, remember that the hung-diagnostics timeout is 1000 sec. In the time-span before the panic nothing else was logged about disks/controllers/etc... not functioning.. Only the few secs before the panic ctl/iSCSI and the network interface started complaining that the was a memory shortage and the networkinterafce started dumping packets.... But all that was logged really nicely in syslog. So I think that in the 1000sec it took for the deadman switch to trigger, the zpool just functioned as was expected.... And the hardware somewhere lost one transaction. So I'll be crossing my fingers, and we'll see when/what/where the next crash in going to occur. And work from there.... --WjW