From owner-freebsd-fs@FreeBSD.ORG  Sun Apr 14 19:44:41 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 80828AD1
 for <freebsd-fs@freebsd.org>; Sun, 14 Apr 2013 19:44:41 +0000 (UTC)
 (envelope-from jdc@koitsu.org)
Received: from qmta01.emeryville.ca.mail.comcast.net
 (qmta01.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:16])
 by mx1.freebsd.org (Postfix) with ESMTP id 63FFDF25
 for <freebsd-fs@freebsd.org>; Sun, 14 Apr 2013 19:44:41 +0000 (UTC)
Received: from omta23.emeryville.ca.mail.comcast.net ([76.96.30.90])
 by qmta01.emeryville.ca.mail.comcast.net with comcast
 id Pvah1l0051wfjNsA1vkhNK; Sun, 14 Apr 2013 19:44:41 +0000
Received: from koitsu.strangled.net ([67.180.84.87])
 by omta23.emeryville.ca.mail.comcast.net with comcast
 id Pvkg1l00G1t3BNj8jvkggn; Sun, 14 Apr 2013 19:44:40 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
 id 36D8773A33; Sun, 14 Apr 2013 12:44:40 -0700 (PDT)
Date: Sun, 14 Apr 2013 12:44:40 -0700
From: Jeremy Chadwick <jdc@koitsu.org>
To: Zaphod Beeblebrox <zbeeble@gmail.com>
Subject: Re: A failed drive causes system to hang
Message-ID: <20130414194440.GB38338@icarus.home.lan>
References: <516A8092.2080002@o2.pl>
 <9C59759CB64B4BE282C1D1345DD0C78E@multiplay.co.uk>
 <516AF61B.7060204@o2.pl> <20130414185117.GA38259@icarus.home.lan>
 <CACpH0Mebufi5=bEsu6MF03NCn6gDmKkx-OP3sP14t3Xe3CXdpw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CACpH0Mebufi5=bEsu6MF03NCn6gDmKkx-OP3sP14t3Xe3CXdpw@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net;
 s=q20121106; t=1365968681;
 bh=0DLeLOXztL+2XP6AuRY3dQhcrDrRHJSz8iTbu3d9Bno=;
 h=Received:Received:Received:Date:From:To:Subject:Message-ID:
 MIME-Version:Content-Type;
 b=KtbYfALT8TkT9r4ArqmVd35mYKsaKk/qmhw+TTr+2TKiRUaX1kSY+cMzJvI2PFYAX
 bwNZ1p1NNxnvRM+WZm6hpurl1mJZJQy+x5sBBcZ4nD7488owDmXDHDVq3Nr42eACjy
 TZrdaWxcy/wB/4fac/PES5dS9K96fljAWKQT9Sjn7z19a31ijU8FTNTPbzOG53uwo5
 hLugpJvLbcmA8ebaPXBtczcU41l1sT8y0qS0h9t1+zM219Z26pjynafOS2CnUoS0CB
 1lpnVPxQZQiZ9EG8htCo7dUXovrJXDYkWJBgZxRI2PPNOtHISxo8M1OOWX6uKPGJQJ
 ebiy9uqHMmFbw==
Cc: freebsd-fs <freebsd-fs@freebsd.org>,
 Radio =?unknown-8bit?B?bcS5P29keWNoIGJhbmR5dMQ/xT93?=
 <radiomlodychbandytow@o2.pl>, support@lists.pcbsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 14 Apr 2013 19:44:41 -0000

On Sun, Apr 14, 2013 at 02:58:15PM -0400, Zaphod Beeblebrox wrote:
> I'd like to throw in my two cents here.  I've seen this (drives in RAID-1
> configuration) hanging whole systems.  Back in the IDE days, two drives
> were connected with one cable --- I largely wrote it off as a deficiency of
> IDE hardware and resolved to by SCSI hardware for more important systems.
> Of late, the physical hardware for SCSI (SAS) and SATA drives have
> converged.  I'm willing to accept that SAS hardware may be built to a
> different standard, but I'm suspicious of the fact that a bad SATA drive on
> an ACH* controller can hang the whole system.

Note to readers: this is borderline off-topic and is going to confuse
the thread even more.  I will respond to this ONLY ONCE, and WILL NOT be
responding to this part of the thread past this point.

I have only seen this happen on very specific controllers (JMicron for
example), where either the AHCI driver was broken/badly written, or the
underlying AHCI option ROM/firmware code was broken/badly written.

> ... it's not complete, however.  Often pulling the drive's cable will
> unfreeze things.  It's also not entirely consistent.  Drives I have
> behind 4:1 port multipliers haven't (so far) hung the system that
> they're on (which uses ACH10).  Right now, I have a remote ACH10
> system that's hung hard a couple of times --- and it passes both it's
> short and long SMART tests on both drives.

PMPs (port multipliers) are a *completely* separate beast, where some
AHCI controllers (at a silicon level) screw up/break.  In fact, the
IXP600/700 is one such controller, and workarounds had to be put into
FreeBSD and Linux for them.  I can dig up the commits if need be.

Rule of thumb (which you know -- this is for other readers): when using
a PM, it's VERY IMPORTANT that be disclosed up front.  These add a
serious complication to analysis of the SATA subsystem as a whole, and
in a lot of cases visibility into details are lost as a result.  PMPs in
general are "bleh".

> Is there no global timeout we can depend on here?

Please see kern.cam.ada.default_timeout (for adaX devices) and
kern.cam.pmp.default_timeout (for I/O requests going across a PMP).
Otherwise Alexander Motin (mav@) would be the guy to ask about PMP
issues, and/or get him hardware + provide a reliable reproduction
methodology for the issue.

All the above said:

Respectfully, please do not conflate your issue with this one.

Please start a new thread (do not reply to this thread and change the
Subject line, please actually start a brand new Email to ensure no
Reference headers are retained) about this issue if you wish.

There is already too much crap going on in this thread with 4 different
people with what are 4 different issues, and nobody at this point is
able to keep track of it all (including the participants).

This situation happens way, WAY too often with storage-related matters
on the list.  ANYTHING ZFS-related and ANYTHING storage-related results
in bandwagon-jumping and threads that spiral out of control/become
almost useless and certainly impossible to follow.  It needs to stop.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |