Date: Tue, 9 Apr 2013 12:52:01 +0100 From: Tom Evans <tevans.uk@googlemail.com> To: Quartz <quartz@sneakertech.com> Cc: FreeBSD FS <freebsd-fs@freebsd.org> Subject: Re: ZFS: Failed pool causes system to hang Message-ID: <CAFHbX1LO9OvbqyYYaob-7nQSA_dwQkMK7%2Bvn9c4QrXQuKvTCFA@mail.gmail.com> In-Reply-To: <5163F03B.9060700@sneakertech.com> References: <2092374421.4491514.1365459764269.JavaMail.root@k-state.edu> <5163F03B.9060700@sneakertech.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 9, 2013 at 11:40 AM, Quartz <quartz@sneakertech.com> wrote: > >> So, you're not really waiting a long time.... > > > I still don't think you're 100% clear on what's happening in my case. I'm > trying to explain that my problem is *prior* to the motherboard resetting, > NOT after. If I hard-reset the machine with the front panel switch, it boots > just fine every time. > > When my pool *FAILS* (ie; is unrecoverable because I lost too many drives) > it hangs effectively all io on the entire machine. I can't cd or ls > directories, I can't run any zfs commands, and I can't issue a reboot or > halt. This is a hang. The machine is completely useless in this state. There > is no disk or cpu activity churning. There's no pool (anymore) to be trying > to resilver or whatever anyway. > > I'm not going to wait 3+ hours for "shutdown -r now" to bring the machine > down. Especially not when I already know that zfs won't let it. > I think what Lawrence is trying to explain is that a "hang" is not necessarily a deadlock. Leaving the system for an extended period may bring it back. What you are saying is also valid, that a hang that long is equivalent to a deadlock in your usage. Computers, even essential dedicated servers sometimes hang, which is why it is common to have some way of remotely power cycling. If your server is important, you need some sort of RAC for these scenarios. So, how to find out where the hang is. Your ZFS pools and your root disk probably - I've not seen a dmesg - share one thing in common, ATA/AHCI. If root does not also use this, does losing the pool still cause problems with root? Perhaps breaking into ddb at this point could tell us something. Cheers Tom
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFHbX1LO9OvbqyYYaob-7nQSA_dwQkMK7%2Bvn9c4QrXQuKvTCFA>