From owner-freebsd-fs@FreeBSD.ORG Sat Apr 13 15:41:32 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 82513A65 for ; Sat, 13 Apr 2013 15:41:32 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta03.emeryville.ca.mail.comcast.net (qmta03.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:32]) by mx1.freebsd.org (Postfix) with ESMTP id 65AF8955 for ; Sat, 13 Apr 2013 15:41:32 +0000 (UTC) Received: from omta09.emeryville.ca.mail.comcast.net ([76.96.30.20]) by qmta03.emeryville.ca.mail.comcast.net with comcast id PRbn1l0010S2fkCA3ThXjg; Sat, 13 Apr 2013 15:41:31 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta09.emeryville.ca.mail.comcast.net with comcast id PThW1l00x1t3BNj8VThXzb; Sat, 13 Apr 2013 15:41:31 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 94EC573A33; Sat, 13 Apr 2013 08:41:30 -0700 (PDT) Date: Sat, 13 Apr 2013 08:41:30 -0700 From: Jeremy Chadwick To: Quartz Subject: Re: A failed drive causes system to hang Message-ID: <20130413154130.GA877@icarus.home.lan> References: <51672164.1090908@o2.pl> <20130411212408.GA60159@icarus.home.lan> <5168821F.5020502@o2.pl> <20130412220350.GA82467@icarus.home.lan> <516917CA.5040607@sneakertech.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <516917CA.5040607@sneakertech.com> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1365867691; bh=I6easHb3pWOU/hbxwiaj98GsAAYEUajP+g6aHpDkc9w=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=j7DBOxTt2pbX3o2INYXH/VIHla6c5elqcrW+UJ4W4TijXUBFJxaF/pUvZDItcSCMv TOO4kCR/qt200CFyUWh5qxp1vtIsrTw6NVrM1bi7KPAp1u6MbExnppv4tIKa14B+LX s4vpmesnlv4gTtw+QUk9Ju9ha4xEl+aWDkIbwUgPE7ryK+nt0JkkQrtS7GJ5FhFICT 3rdHhjrQArhqqZMP+LrA8yHaPdJ6RtuzQOlCWTUDRmRjTrywqEzhdSpk0yiZS2umGC ZuSqORPcvqIPhryuaXZ23z4VBoF1Xm+xn+c2zSirCKCB9c/umXNv9PL/RfzPLt9xAq kgWmGb7YaLdzw== Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Apr 2013 15:41:32 -0000 On Sat, Apr 13, 2013 at 04:31:06AM -0400, Quartz wrote: > >If the ZFS layer > >is waiting on CAM, and CAM is waiting on your hardware, then those I/O > >requests are going to block indefinitely. > > >2. I agree that the problem is not likely in ZFS, but rather either with > >CAM, the AHCI implementation used, or hardware (either disk or storage > >controller). > > Question: > > How (or does) this relate to the hang that I'm seeing with my > system? It doesn't relate in any way, shape, or form. This is what happens when end-users start to try and "correlate" issues to one another's without actually taking the time to fully read the thread and follow along actively. This has now happened *twice* with this thread (once from user Lawrence K. Chen, and now another from radiomlodychbandytow@o2.pl). This sort of behavioural thing has happened with FreeBSD, particularly with regards to storage/filesystems/etc., for as long as I can remember. I am not going to get into a discussion on how to solve such social dilemmas because the procedure is to use send-pr and wait for someone in-the-know to respond asking for relevant information. The FreeBSD Handbook goes over how to file a PR and what to put in it. http://www.freebsd.org/send-pr.html http://www.freebsd.org/doc/en_US.ISO8859-1/articles/problem-reports/article.html > You mentioned cam issues when talking to me earlier, but > less decisively than your comment here. What's the difference? Your issue: "on my raidz2 pool, when I lose more than 2 disks, I/O to the pool stalls indefinitely, but I can still use the system barring ZFS-related things; I don't know how to get the system back into a usable state from this situation". That's based on these two statements: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016822.html http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016847.html radiomlodychbandytow@o2.pl's issue: "I'm seeing ATA-level errors from one or more of my disks, can someone help?" Lawrence K. Chen's issue: "I had a crash/issue and then the system hung for a very long time at the mountroot phase". Given the information known at this time, ALL THREE of these issues are unrelated to one another. As I've said elsewhere: it is very important every single issue reported is handled individually/separately. I was given this advice from a FreeBSD kernel developer some years ago and it's excellent. It might seem logical to try and correlate such things, but a lot of the time this turns out to be wrong and is a great waste of everyone's time. So Just Don't Do It(tm). > >We're also > >going to need to see "zpool status" output, as well as "zpool get all" > >and "zfs get all". "pciconf -lvbc" would also be useful. > > You never asked for these when talking to me, but I can provide any > of it if you want to look at it. At this point in the conversation, WRT your issue, there's no indication that it would help, but you've already given dmesg output: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016840.html Else, all you've provided so far is a general explanation. You have still not provided concise step-by-step information like I've asked. I've gone so far as to give you an example of what to provide: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html I will again point to the 2nd-to-last paragraph of my above referenced mail. Another example of troubleshooting and how to do it: here's effort I went through over the course of some months to track down a bug in CAM: http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016324.html READ: I'm not saying your issue is with CAM (it may be, but it may not be -- there isn't enough information right now to determine that). I'm giving you an example of the troubleshooting/debugging effort that has to go into things for issues of this nature. You can even see from my quoted material in that link that I spent many hours doing step-by-step QA only to find I messed up in the process and had to start over the following day. It happens. Once concise details are given and (highly preferable!) a step-by-step way to reproduce the issue 100% of the time (including all commands, all output seen, all physical actions taken, etc.), then the kernel folks tend to get involved. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |