From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 21 17:22:47 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id D8645632
 for <freebsd-fs@freebsd.org>; Thu, 21 Mar 2013 17:22:47 +0000 (UTC)
 (envelope-from quartz@sneakertech.com)
Received: from relay00.pair.com (relay00.pair.com [209.68.5.9])
 by mx1.freebsd.org (Postfix) with SMTP id 9495AE0B
 for <freebsd-fs@freebsd.org>; Thu, 21 Mar 2013 17:22:47 +0000 (UTC)
Received: (qmail 57110 invoked by uid 0); 21 Mar 2013 17:22:40 -0000
Received: from 173.48.104.62 (HELO ?10.2.2.1?) (173.48.104.62)
 by relay00.pair.com with SMTP; 21 Mar 2013 17:22:40 -0000
X-pair-Authenticated: 173.48.104.62
Message-ID: <514B41DF.4010802@sneakertech.com>
Date: Thu, 21 Mar 2013 13:22:39 -0400
From: Quartz <quartz@sneakertech.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
To: Jeremy Chadwick <jdc@koitsu.org>
Subject: ZFS: Failed pool causes system to hang
References: <20130321044557.GA15977@icarus.home.lan>
 <514AA192.2090006@sneakertech.com> <20130321085304.GB16997@icarus.home.lan>
In-Reply-To: <20130321085304.GB16997@icarus.home.lan>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Mar 2013 17:22:47 -0000

So looking through the past couple emails, it doesn't appear that anyone 
included a complete copy of my description when CCing -fs, so I'm going 
to do that now just so there's no confusion. I'll respond to Jeremy's 
questions separately.


------


I have a raidz2 comprised of six sata drives connected via my 
motherboard's intel southbridge sata ports. All of the bios raid options 
are disabled and the drives are in straight ahci mode (hotswap enabled). 
The system (accounts, home dir, etc) is installed on a separate 7th 
drive formatted as normal ufs, connected to a separate non-intel 
motherboard port.

As part of my initial stress testing, I'm simulating failures by popping 
the sata cable to various drives in the 6x pool. If I pop two drives, 
the pool goes into 'degraded' mode and everything works as expected. I 
can zero and replace the drives, etc, no problem. However, when I pop a 
third drive, the machine becomes VERY unstable. I can nose around the 
boot drive just fine, but anything involving i/o that so much as sneezes 
in the general direction of the pool hangs the machine. Once this 
happens I can log in via ssh, but that's pretty much it. I've 
reinstalled and tested this over a dozen times, and it's perfectly 
repeatable:

`ls` the dir where the pool is mounted? hang.
I'm already in the dir, and try to `cd` back to my home dir? hang.
zpool destroy? hang.
zpool replace? hang.
zpool history? hang.
shutdown -r now? gets halfway through, then hang.
reboot -q? same as shutdown.

The machine never recovers (at least, not inside 35 minutes, which is 
the most I'm willing to wait). Reconnecting the drives has no effect. My 
only option is to hard reset the machine with the front panel button. 
Googling for info suggested I try changing the pool's "failmode" setting 
from "wait" to "continue", but that doesn't appear to make any 
difference. For reference, this is a virgin 9.1-release installed off 
the dvd image with no ports or packages or any extra anything.

I don't think I'm doing anything wrong procedure wise. I fully 
understand and accept that a raidz2 with three dead drives is toast, but 
I will NOT accept having it take down the rest of the machine with it. 
As it stands, I can't even reliably look at what state the pool is in. I 
can't even nuke the pool and start over without taking the whole machine 
offline.


______________________________________
it has a certain smooth-brained appeal