Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2013 14:24:08 -0700
From:      Jeremy Chadwick <jdc@koitsu.org>
To:        Radio =?unknown-8bit?B?bcU/b2R5Y2ggYmFuZHl0w7N3?= <radiomlodychbandytow@o2.pl>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: A failed drive causes system to hang
Message-ID:  <20130411212408.GA60159@icarus.home.lan>
In-Reply-To: <51672164.1090908@o2.pl>
References:  <mailman.11.1365681601.78138.freebsd-fs@freebsd.org> <51672164.1090908@o2.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote:
> Seeing a ZFS thread, I decided to write about a similar problem that
> I experience.
> I have a failing drive in my array. I need to RMA it, but don't have
> time and it fails rarely enough to be a yet another annoyance.
> The failure is simple: it fails to respond.
> When it happens, the only thing I found I can do is switch consoles.
> Any command fails, login fails, apps hang.
> 
> On the 1st console I see a series of messages like:
> 
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED
> 
> I use RAIDZ1 and I'd expect that none single failure would cause the
> system to fail...

You need to provide full output from "dmesg", and you need to define
what the word "fails" means (re: "any command fails", "login fails").

I've already demonstrated that loss of a disk in raidz1 (or even 2 disks
in raidz2) does not cause ""the system to fail"" on stable/9.  However,
if you lose enough members or vdevs to cause catastrophic failure, there
may be anomalies depending on how your system is set up:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html

If the pool has failmode=wait, any I/O to that pool will block (wait)
indefinitely.  This is the default.

If the pool has failmode=continue, existing write I/O operations will
fail with EIO (I/O error) (and hopefully applications/daemons will
handle that gracefully -- if not, that's their fault) but any subsequent
I/O (read or write) to that pool will block (wait) indefinitely.

If the pool has failmode=panic, the kernel will immediately panic.

If the CAM layer is what's wedged, that may be a different issue (and
not related to ZFS).  I would suggest running stable/9 as many
improvements in this regard have been committed recently (some related
to CAM, others related to ZFS and its new "deadman" watcher).

Bottom line: terse output of the problem does not help.  Be verbose,
provide all output (commands you type, everything!), as well as any
physical actions you take.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130411212408.GA60159>