From owner-freebsd-fs@FreeBSD.ORG  Sat Apr 13 15:41:32 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 82513A65
 for <freebsd-fs@freebsd.org>; Sat, 13 Apr 2013 15:41:32 +0000 (UTC)
 (envelope-from jdc@koitsu.org)
Received: from qmta03.emeryville.ca.mail.comcast.net
 (qmta03.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:32])
 by mx1.freebsd.org (Postfix) with ESMTP id 65AF8955
 for <freebsd-fs@freebsd.org>; Sat, 13 Apr 2013 15:41:32 +0000 (UTC)
Received: from omta09.emeryville.ca.mail.comcast.net ([76.96.30.20])
 by qmta03.emeryville.ca.mail.comcast.net with comcast
 id PRbn1l0010S2fkCA3ThXjg; Sat, 13 Apr 2013 15:41:31 +0000
Received: from koitsu.strangled.net ([67.180.84.87])
 by omta09.emeryville.ca.mail.comcast.net with comcast
 id PThW1l00x1t3BNj8VThXzb; Sat, 13 Apr 2013 15:41:31 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
 id 94EC573A33; Sat, 13 Apr 2013 08:41:30 -0700 (PDT)
Date: Sat, 13 Apr 2013 08:41:30 -0700
From: Jeremy Chadwick <jdc@koitsu.org>
To: Quartz <quartz@sneakertech.com>
Subject: Re: A failed drive causes system to hang
Message-ID: <20130413154130.GA877@icarus.home.lan>
References: <mailman.11.1365681601.78138.freebsd-fs@freebsd.org>
 <51672164.1090908@o2.pl> <20130411212408.GA60159@icarus.home.lan>
 <5168821F.5020502@o2.pl> <20130412220350.GA82467@icarus.home.lan>
 <516917CA.5040607@sneakertech.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <516917CA.5040607@sneakertech.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net;
 s=q20121106; t=1365867691;
 bh=I6easHb3pWOU/hbxwiaj98GsAAYEUajP+g6aHpDkc9w=;
 h=Received:Received:Received:Date:From:To:Subject:Message-ID:
 MIME-Version:Content-Type;
 b=j7DBOxTt2pbX3o2INYXH/VIHla6c5elqcrW+UJ4W4TijXUBFJxaF/pUvZDItcSCMv
 TOO4kCR/qt200CFyUWh5qxp1vtIsrTw6NVrM1bi7KPAp1u6MbExnppv4tIKa14B+LX
 s4vpmesnlv4gTtw+QUk9Ju9ha4xEl+aWDkIbwUgPE7ryK+nt0JkkQrtS7GJ5FhFICT
 3rdHhjrQArhqqZMP+LrA8yHaPdJ6RtuzQOlCWTUDRmRjTrywqEzhdSpk0yiZS2umGC
 ZuSqORPcvqIPhryuaXZ23z4VBoF1Xm+xn+c2zSirCKCB9c/umXNv9PL/RfzPLt9xAq
 kgWmGb7YaLdzw==
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 13 Apr 2013 15:41:32 -0000

On Sat, Apr 13, 2013 at 04:31:06AM -0400, Quartz wrote:
> >If the ZFS layer
> >is waiting on CAM, and CAM is waiting on your hardware, then those I/O
> >requests are going to block indefinitely.
> 
> >2. I agree that the problem is not likely in ZFS, but rather either with
> >CAM, the AHCI implementation used, or hardware (either disk or storage
> >controller).
> 
> Question:
> 
> How (or does) this relate to the hang that I'm seeing with my
> system?

It doesn't relate in any way, shape, or form.

This is what happens when end-users start to try and "correlate" issues
to one another's without actually taking the time to fully read the
thread and follow along actively.  This has now happened *twice* with
this thread (once from user Lawrence K. Chen, and now another from
radiomlodychbandytow@o2.pl).

This sort of behavioural thing has happened with FreeBSD, particularly
with regards to storage/filesystems/etc., for as long as I can remember.

I am not going to get into a discussion on how to solve such social
dilemmas because the procedure is to use send-pr and wait for someone
in-the-know to respond asking for relevant information.  The FreeBSD
Handbook goes over how to file a PR and what to put in it.

http://www.freebsd.org/send-pr.html
http://www.freebsd.org/doc/en_US.ISO8859-1/articles/problem-reports/article.html

> You mentioned cam issues when talking to me earlier, but
> less decisively than your comment here. What's the difference?

Your issue: "on my raidz2 pool, when I lose more than 2 disks, I/O to
the pool stalls indefinitely, but I can still use the system barring
ZFS-related things; I don't know how to get the system back into a
usable state from this situation".  That's based on these two
statements:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016822.html
http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016847.html

radiomlodychbandytow@o2.pl's issue: "I'm seeing ATA-level errors from
one or more of my disks, can someone help?"

Lawrence K. Chen's issue: "I had a crash/issue and then the system hung
for a very long time at the mountroot phase".

Given the information known at this time, ALL THREE of these issues are
unrelated to one another.

As I've said elsewhere: it is very important every single issue reported
is handled individually/separately.  I was given this advice from a
FreeBSD kernel developer some years ago and it's excellent.  It might
seem logical to try and correlate such things, but a lot of the time
this turns out to be wrong and is a great waste of everyone's time.  So
Just Don't Do It(tm).

> >We're also
> >going to need to see "zpool status" output, as well as "zpool get all"
> >and "zfs get all".  "pciconf -lvbc" would also be useful.
> 
> You never asked for these when talking to me, but I can provide any
> of it if you want to look at it.

At this point in the conversation, WRT your issue, there's no indication
that it would help, but you've already given dmesg output:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016840.html

Else, all you've provided so far is a general explanation.  You have
still not provided concise step-by-step information like I've asked.
I've gone so far as to give you an example of what to provide:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html

I will again point to the 2nd-to-last paragraph of my above referenced
mail.

Another example of troubleshooting and how to do it: here's effort I
went through over the course of some months to track down a bug in CAM:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016324.html

READ: I'm not saying your issue is with CAM (it may be, but it may not
be -- there isn't enough information right now to determine that).  I'm
giving you an example of the troubleshooting/debugging effort that has
to go into things for issues of this nature.  You can even see from my
quoted material in that link that I spent many hours doing step-by-step
QA only to find I messed up in the process and had to start over the
following day.  It happens.

Once concise details are given and (highly preferable!) a step-by-step
way to reproduce the issue 100% of the time (including all commands, all
output seen, all physical actions taken, etc.), then the kernel folks
tend to get involved.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |