From owner-freebsd-fs@FreeBSD.ORG  Wed Apr 10 19:37:57 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5AF31F0E
 for <freebsd-fs@freebsd.org>; Wed, 10 Apr 2013 19:37:57 +0000 (UTC)
 (envelope-from lkchen@k-state.edu)
Received: from ksu-out.merit.edu (ksu-out.merit.edu [207.75.117.133])
 by mx1.freebsd.org (Postfix) with ESMTP id 284173E5
 for <freebsd-fs@freebsd.org>; Wed, 10 Apr 2013 19:37:56 +0000 (UTC)
X-Merit-ExtLoop1: 1
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Av8EAFy+ZVHPS3TT/2dsb2JhbABQhme/JRZ0gh8BAQUjVgwPGgINdAaIJ6xOiWyJEYEjjCgDF4R5A6gOgyeCDA
X-IronPort-AV: E=Sophos;i="4.87,449,1363147200"; d="scan'208";a="916200004"
X-MERIT-SOURCE: KSU
Received: from ksu-sfpop-mailstore02.merit.edu ([207.75.116.211])
 by sfpop-ironport05.merit.edu with ESMTP; 10 Apr 2013 15:17:50 -0400
Date: Wed, 10 Apr 2013 15:17:50 -0400 (EDT)
From: "Lawrence K. Chen, P.Eng." <lkchen@ksu.edu>
To: Quartz <quartz@sneakertech.com>
Message-ID: <499967956.5577199.1365621470123.JavaMail.root@k-state.edu>
In-Reply-To: <5163F03B.9060700@sneakertech.com>
Subject: Re: ZFS: Failed pool causes system to hang
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [129.130.0.181]
X-Mailer: Zimbra 7.2.2_GA_2852 (ZimbraWebClient - GC25
 ([unknown])/7.2.2_GA_2852)
Cc: FreeBSD FS <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Apr 2013 19:37:57 -0000


----- Original Message -----
> 
> > So, you're not really waiting a long time....
> 
> I still don't think you're 100% clear on what's happening in my case.
> I'm trying to explain that my problem is *prior* to the motherboard
> resetting, NOT after. If I hard-reset the machine with the front
> panel
> switch, it boots just fine every time.
> 
> When my pool *FAILS* (ie; is unrecoverable because I lost too many
> drives) it hangs effectively all io on the entire machine. I can't cd
> or
> ls directories, I can't run any zfs commands, and I can't issue a
> reboot
> or halt. This is a hang. The machine is completely useless in this
> state. There is no disk or cpu activity churning. There's no pool
> (anymore) to be trying to resilver or whatever anyway.
> 
> I'm not going to wait 3+ hours for "shutdown -r now" to bring the
> machine down. Especially not when I already know that zfs won't let
> it.
> 

Well, that's a different kind of hang....that's the same kind of hang when an NFS fileserver goes away....and anything that accesses the non-responding mounts will block until the server responds again.

And apparently by design, since zpool failmode=wait is default, which means all I/O on the system attempts to retry the devices.

Other options are, failmode=continue is described that the system will continue on as if nothing has changed.  And, failmode=panic...cause system to panic and dump core.  Then it depends on what you have set for system to do after a panic.  Not sure what it means the system will continue...suppose it just immediately errors out the I/O operations to the affected mounts.

The other option is failmode=panic.  Which might be what you want in this case, since you can't shutdown gracefully with all I/O hanging.  Though the shutdown timer seems to still kick in when I've had this happen.  Though watchdog seems to get turned off early on in the shutdown process.  Though panic doesn't doesn't seem to reboot always....though maybe I should see about having it not reboot.

I suppose I could try failmode=panic or failmode=continue....have a problem where if there's a power dropout, the transition to and from UPS battery will sometimes lockup one enclosure or another....and there's no way to redistribute disks such that I won't lose one zpool or another.  Can't seem to get FreeBSD to redetect the enclosure, so rebooting gets it seeing the drives again.

I've changed out enclosures....so it might be something about this particular UPS or that there's a power-conditioner on this circuit.  Since another system at the other end of building has never had this kind of problem before...and it had more hanging off it.  I'm sure there'll be a dropout or worse in the near future...especially with spring thunderstorms lurking already.

Been thinking of getting a double conversion UPS for this server...

Otherwise, the filesystems on these zpools aren't really critical to the operation of the server (though the contents are important to me) one is my backup pool (backuppc) and another has archival/replication data (plus data that I extracted from a corrupt drive that I need to go through and see what's usable....)