Date: Sat, 13 Apr 2013 08:41:30 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Quartz <quartz@sneakertech.com> Cc: freebsd-fs@freebsd.org Subject: Re: A failed drive causes system to hang Message-ID: <20130413154130.GA877@icarus.home.lan> In-Reply-To: <516917CA.5040607@sneakertech.com> References: <mailman.11.1365681601.78138.freebsd-fs@freebsd.org> <51672164.1090908@o2.pl> <20130411212408.GA60159@icarus.home.lan> <5168821F.5020502@o2.pl> <20130412220350.GA82467@icarus.home.lan> <516917CA.5040607@sneakertech.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Apr 13, 2013 at 04:31:06AM -0400, Quartz wrote: > >If the ZFS layer > >is waiting on CAM, and CAM is waiting on your hardware, then those I/O > >requests are going to block indefinitely. > > >2. I agree that the problem is not likely in ZFS, but rather either with > >CAM, the AHCI implementation used, or hardware (either disk or storage > >controller). > > Question: > > How (or does) this relate to the hang that I'm seeing with my > system? It doesn't relate in any way, shape, or form. This is what happens when end-users start to try and "correlate" issues to one another's without actually taking the time to fully read the thread and follow along actively. This has now happened *twice* with this thread (once from user Lawrence K. Chen, and now another from radiomlodychbandytow@o2.pl). This sort of behavioural thing has happened with FreeBSD, particularly with regards to storage/filesystems/etc., for as long as I can remember. I am not going to get into a discussion on how to solve such social dilemmas because the procedure is to use send-pr and wait for someone in-the-know to respond asking for relevant information. The FreeBSD Handbook goes over how to file a PR and what to put in it. http://www.freebsd.org/send-pr.html http://www.freebsd.org/doc/en_US.ISO8859-1/articles/problem-reports/article.html > You mentioned cam issues when talking to me earlier, but > less decisively than your comment here. What's the difference? Your issue: "on my raidz2 pool, when I lose more than 2 disks, I/O to the pool stalls indefinitely, but I can still use the system barring ZFS-related things; I don't know how to get the system back into a usable state from this situation". That's based on these two statements: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016822.html http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016847.html radiomlodychbandytow@o2.pl's issue: "I'm seeing ATA-level errors from one or more of my disks, can someone help?" Lawrence K. Chen's issue: "I had a crash/issue and then the system hung for a very long time at the mountroot phase". Given the information known at this time, ALL THREE of these issues are unrelated to one another. As I've said elsewhere: it is very important every single issue reported is handled individually/separately. I was given this advice from a FreeBSD kernel developer some years ago and it's excellent. It might seem logical to try and correlate such things, but a lot of the time this turns out to be wrong and is a great waste of everyone's time. So Just Don't Do It(tm). > >We're also > >going to need to see "zpool status" output, as well as "zpool get all" > >and "zfs get all". "pciconf -lvbc" would also be useful. > > You never asked for these when talking to me, but I can provide any > of it if you want to look at it. At this point in the conversation, WRT your issue, there's no indication that it would help, but you've already given dmesg output: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016840.html Else, all you've provided so far is a general explanation. You have still not provided concise step-by-step information like I've asked. I've gone so far as to give you an example of what to provide: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html I will again point to the 2nd-to-last paragraph of my above referenced mail. Another example of troubleshooting and how to do it: here's effort I went through over the course of some months to track down a bug in CAM: http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016324.html READ: I'm not saying your issue is with CAM (it may be, but it may not be -- there isn't enough information right now to determine that). I'm giving you an example of the troubleshooting/debugging effort that has to go into things for issues of this nature. You can even see from my quoted material in that link that I spent many hours doing step-by-step QA only to find I messed up in the process and had to start over the following day. It happens. Once concise details are given and (highly preferable!) a step-by-step way to reproduce the issue 100% of the time (including all commands, all output seen, all physical actions taken, etc.), then the kernel folks tend to get involved. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130413154130.GA877>