From owner-freebsd-fs@FreeBSD.ORG Sun Apr 14 19:44:41 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 80828AD1 for ; Sun, 14 Apr 2013 19:44:41 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta01.emeryville.ca.mail.comcast.net (qmta01.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:16]) by mx1.freebsd.org (Postfix) with ESMTP id 63FFDF25 for ; Sun, 14 Apr 2013 19:44:41 +0000 (UTC) Received: from omta23.emeryville.ca.mail.comcast.net ([76.96.30.90]) by qmta01.emeryville.ca.mail.comcast.net with comcast id Pvah1l0051wfjNsA1vkhNK; Sun, 14 Apr 2013 19:44:41 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta23.emeryville.ca.mail.comcast.net with comcast id Pvkg1l00G1t3BNj8jvkggn; Sun, 14 Apr 2013 19:44:40 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 36D8773A33; Sun, 14 Apr 2013 12:44:40 -0700 (PDT) Date: Sun, 14 Apr 2013 12:44:40 -0700 From: Jeremy Chadwick To: Zaphod Beeblebrox Subject: Re: A failed drive causes system to hang Message-ID: <20130414194440.GB38338@icarus.home.lan> References: <516A8092.2080002@o2.pl> <9C59759CB64B4BE282C1D1345DD0C78E@multiplay.co.uk> <516AF61B.7060204@o2.pl> <20130414185117.GA38259@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1365968681; bh=0DLeLOXztL+2XP6AuRY3dQhcrDrRHJSz8iTbu3d9Bno=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=KtbYfALT8TkT9r4ArqmVd35mYKsaKk/qmhw+TTr+2TKiRUaX1kSY+cMzJvI2PFYAX bwNZ1p1NNxnvRM+WZm6hpurl1mJZJQy+x5sBBcZ4nD7488owDmXDHDVq3Nr42eACjy TZrdaWxcy/wB/4fac/PES5dS9K96fljAWKQT9Sjn7z19a31ijU8FTNTPbzOG53uwo5 hLugpJvLbcmA8ebaPXBtczcU41l1sT8y0qS0h9t1+zM219Z26pjynafOS2CnUoS0CB 1lpnVPxQZQiZ9EG8htCo7dUXovrJXDYkWJBgZxRI2PPNOtHISxo8M1OOWX6uKPGJQJ ebiy9uqHMmFbw== Cc: freebsd-fs , Radio =?unknown-8bit?B?bcS5P29keWNoIGJhbmR5dMQ/xT93?= , support@lists.pcbsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Apr 2013 19:44:41 -0000 On Sun, Apr 14, 2013 at 02:58:15PM -0400, Zaphod Beeblebrox wrote: > I'd like to throw in my two cents here. I've seen this (drives in RAID-1 > configuration) hanging whole systems. Back in the IDE days, two drives > were connected with one cable --- I largely wrote it off as a deficiency of > IDE hardware and resolved to by SCSI hardware for more important systems. > Of late, the physical hardware for SCSI (SAS) and SATA drives have > converged. I'm willing to accept that SAS hardware may be built to a > different standard, but I'm suspicious of the fact that a bad SATA drive on > an ACH* controller can hang the whole system. Note to readers: this is borderline off-topic and is going to confuse the thread even more. I will respond to this ONLY ONCE, and WILL NOT be responding to this part of the thread past this point. I have only seen this happen on very specific controllers (JMicron for example), where either the AHCI driver was broken/badly written, or the underlying AHCI option ROM/firmware code was broken/badly written. > ... it's not complete, however. Often pulling the drive's cable will > unfreeze things. It's also not entirely consistent. Drives I have > behind 4:1 port multipliers haven't (so far) hung the system that > they're on (which uses ACH10). Right now, I have a remote ACH10 > system that's hung hard a couple of times --- and it passes both it's > short and long SMART tests on both drives. PMPs (port multipliers) are a *completely* separate beast, where some AHCI controllers (at a silicon level) screw up/break. In fact, the IXP600/700 is one such controller, and workarounds had to be put into FreeBSD and Linux for them. I can dig up the commits if need be. Rule of thumb (which you know -- this is for other readers): when using a PM, it's VERY IMPORTANT that be disclosed up front. These add a serious complication to analysis of the SATA subsystem as a whole, and in a lot of cases visibility into details are lost as a result. PMPs in general are "bleh". > Is there no global timeout we can depend on here? Please see kern.cam.ada.default_timeout (for adaX devices) and kern.cam.pmp.default_timeout (for I/O requests going across a PMP). Otherwise Alexander Motin (mav@) would be the guy to ask about PMP issues, and/or get him hardware + provide a reliable reproduction methodology for the issue. All the above said: Respectfully, please do not conflate your issue with this one. Please start a new thread (do not reply to this thread and change the Subject line, please actually start a brand new Email to ensure no Reference headers are retained) about this issue if you wish. There is already too much crap going on in this thread with 4 different people with what are 4 different issues, and nobody at this point is able to keep track of it all (including the participants). This situation happens way, WAY too often with storage-related matters on the list. ANYTHING ZFS-related and ANYTHING storage-related results in bandwagon-jumping and threads that spiral out of control/become almost useless and certainly impossible to follow. It needs to stop. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |