Date: Thu, 11 Apr 2013 14:24:08 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Radio =?unknown-8bit?B?bcU/b2R5Y2ggYmFuZHl0w7N3?= <radiomlodychbandytow@o2.pl> Cc: freebsd-fs@freebsd.org Subject: Re: A failed drive causes system to hang Message-ID: <20130411212408.GA60159@icarus.home.lan> In-Reply-To: <51672164.1090908@o2.pl> References: <mailman.11.1365681601.78138.freebsd-fs@freebsd.org> <51672164.1090908@o2.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote: > Seeing a ZFS thread, I decided to write about a similar problem that > I experience. > I have a failing drive in my array. I need to RMA it, but don't have > time and it fails rarely enough to be a yet another annoyance. > The failure is simple: it fails to respond. > When it happens, the only thing I found I can do is switch consoles. > Any command fails, login fails, apps hang. > > On the 1st console I see a series of messages like: > > (ada0:ahcich0:0:0:0): CAM status: Command timeout > (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated > (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED > > I use RAIDZ1 and I'd expect that none single failure would cause the > system to fail... You need to provide full output from "dmesg", and you need to define what the word "fails" means (re: "any command fails", "login fails"). I've already demonstrated that loss of a disk in raidz1 (or even 2 disks in raidz2) does not cause ""the system to fail"" on stable/9. However, if you lose enough members or vdevs to cause catastrophic failure, there may be anomalies depending on how your system is set up: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html If the pool has failmode=wait, any I/O to that pool will block (wait) indefinitely. This is the default. If the pool has failmode=continue, existing write I/O operations will fail with EIO (I/O error) (and hopefully applications/daemons will handle that gracefully -- if not, that's their fault) but any subsequent I/O (read or write) to that pool will block (wait) indefinitely. If the pool has failmode=panic, the kernel will immediately panic. If the CAM layer is what's wedged, that may be a different issue (and not related to ZFS). I would suggest running stable/9 as many improvements in this regard have been committed recently (some related to CAM, others related to ZFS and its new "deadman" watcher). Bottom line: terse output of the problem does not help. Be verbose, provide all output (commands you type, everything!), as well as any physical actions you take. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130411212408.GA60159>