Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Apr 2013 15:17:50 -0400 (EDT)
From:      "Lawrence K. Chen, P.Eng." <lkchen@ksu.edu>
To:        Quartz <quartz@sneakertech.com>
Cc:        FreeBSD FS <freebsd-fs@freebsd.org>
Subject:   Re: ZFS: Failed pool causes system to hang
Message-ID:  <499967956.5577199.1365621470123.JavaMail.root@k-state.edu>
In-Reply-To: <5163F03B.9060700@sneakertech.com>

next in thread | previous in thread | raw e-mail | index | archive | help


----- Original Message -----
> 
> > So, you're not really waiting a long time....
> 
> I still don't think you're 100% clear on what's happening in my case.
> I'm trying to explain that my problem is *prior* to the motherboard
> resetting, NOT after. If I hard-reset the machine with the front
> panel
> switch, it boots just fine every time.
> 
> When my pool *FAILS* (ie; is unrecoverable because I lost too many
> drives) it hangs effectively all io on the entire machine. I can't cd
> or
> ls directories, I can't run any zfs commands, and I can't issue a
> reboot
> or halt. This is a hang. The machine is completely useless in this
> state. There is no disk or cpu activity churning. There's no pool
> (anymore) to be trying to resilver or whatever anyway.
> 
> I'm not going to wait 3+ hours for "shutdown -r now" to bring the
> machine down. Especially not when I already know that zfs won't let
> it.
> 

Well, that's a different kind of hang....that's the same kind of hang when an NFS fileserver goes away....and anything that accesses the non-responding mounts will block until the server responds again.

And apparently by design, since zpool failmode=wait is default, which means all I/O on the system attempts to retry the devices.

Other options are, failmode=continue is described that the system will continue on as if nothing has changed.  And, failmode=panic...cause system to panic and dump core.  Then it depends on what you have set for system to do after a panic.  Not sure what it means the system will continue...suppose it just immediately errors out the I/O operations to the affected mounts.

The other option is failmode=panic.  Which might be what you want in this case, since you can't shutdown gracefully with all I/O hanging.  Though the shutdown timer seems to still kick in when I've had this happen.  Though watchdog seems to get turned off early on in the shutdown process.  Though panic doesn't doesn't seem to reboot always....though maybe I should see about having it not reboot.

I suppose I could try failmode=panic or failmode=continue....have a problem where if there's a power dropout, the transition to and from UPS battery will sometimes lockup one enclosure or another....and there's no way to redistribute disks such that I won't lose one zpool or another.  Can't seem to get FreeBSD to redetect the enclosure, so rebooting gets it seeing the drives again.

I've changed out enclosures....so it might be something about this particular UPS or that there's a power-conditioner on this circuit.  Since another system at the other end of building has never had this kind of problem before...and it had more hanging off it.  I'm sure there'll be a dropout or worse in the near future...especially with spring thunderstorms lurking already.

Been thinking of getting a double conversion UPS for this server...

Otherwise, the filesystems on these zpools aren't really critical to the operation of the server (though the contents are important to me) one is my backup pool (backuppc) and another has archival/replication data (plus data that I extracted from a corrupt drive that I need to go through and see what's usable....)



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?499967956.5577199.1365621470123.JavaMail.root>