From owner-freebsd-fs@FreeBSD.ORG  Fri Jan 25 08:36:21 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id C7A9EBCC
 for <freebsd-fs@freebsd.org>; Fri, 25 Jan 2013 08:36:21 +0000 (UTC)
 (envelope-from jdc@koitsu.org)
Received: from qmta14.emeryville.ca.mail.comcast.net
 (qmta14.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:212])
 by mx1.freebsd.org (Postfix) with ESMTP id ADB99720
 for <freebsd-fs@freebsd.org>; Fri, 25 Jan 2013 08:36:21 +0000 (UTC)
Received: from omta05.emeryville.ca.mail.comcast.net ([76.96.30.43])
 by qmta14.emeryville.ca.mail.comcast.net with comcast
 id s8bg1k0020vp7WLAE8cLsF; Fri, 25 Jan 2013 08:36:20 +0000
Received: from koitsu.strangled.net ([67.180.84.87])
 by omta05.emeryville.ca.mail.comcast.net with comcast
 id s8cK1k0081t3BNj8R8cKZW; Fri, 25 Jan 2013 08:36:19 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
 id 6A48E73A1C; Fri, 25 Jan 2013 00:36:19 -0800 (PST)
Date: Fri, 25 Jan 2013 00:36:19 -0800
From: Jeremy Chadwick <jdc@koitsu.org>
To: Alexander Motin <mav@FreeBSD.org>
Subject: Re: disk "flipped" - a known problem?
Message-ID: <20130125083619.GA51096@icarus.home.lan>
References: <20130121221617.GA23909@icarus.home.lan>
 <50FED818.7070704@FreeBSD.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <50FED818.7070704@FreeBSD.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net;
 s=q20121106; t=1359102980;
 bh=Yu/l8WSrYcTVJGNg/Wi4+q++ziKm8aNU3Gv6N7ZpbPI=;
 h=Received:Received:Received:Date:From:To:Subject:Message-ID:
 MIME-Version:Content-Type;
 b=BJFH/4ScrPr2u6DGsqrcBMboiDyzKZYZpXARbRaVPlhc7aJmGdU7XICmmrEkzBpPL
 zeuIbjQFuBn58p9hreuftH6W6e9zFLfMvsSJr16Jas3O1r8FBL4/ceD37lPJWnfTHB
 ahqo9Oq9h257m3LiUahnFHdxmTIg5iFC9UQ7IjmsD0HKRwDWAdjuI1KW14pilCJd2x
 0KBU2Po/jzlkbljb5RBa8Nn9S6JThGX21lbcN6WwXbFPxWQ8jDlxK6/glECK/cvot/
 gLdEQ1FFSmubchhVcMDHcKUUtF90pquzDK1L2JXTuYLwSTxTzgyNNpVg1KjcJ5peNc
 oqOQzsTNPt8Lg==
Cc: freebsd-fs@freebsd.org, avg@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Jan 2013 08:36:21 -0000

On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote:
> On 22.01.2013 00:16, Jeremy Chadwick wrote:
> > (Please keep me CC'd as I am not subscribed)
> > 
> > WRT this:
> > 
> > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html
> > 
> > I can reproduce the first problem 100% of the time on my home system
> > here.  I can provide hardware specs if needed, but the important part is
> > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI
> > mode (and does not share an IRQ), hot-swap bays are in use, and I'm
> > using ahci.ko.
> > 
> > I also want to make this clear to Andriy: I'm not saying "there's a
> > problem with your disk".  In my case, I KNOW there's a problem with the
> > disk (that's the entire point to my tests! :-) ).
> > 
> > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very
> > badly-designed firmware that goes completely catatonic when encountering
> > certain sector-level conditions.  That's not the problem though -- the
> > problem is with FreeBSD apparently getting confused as to the internal
> > state of its devices after a device falls off the bus and comes back.
> > Explanation:
> > 
> > 1. System powered off; disk is attached; system powered on, shows up as
> > ada5.  Can communicate with device in every way (the way I tend to test
> > simple I/O is to use "smartctl -a /dev/ada5").  This disk has no
> > filesystems or other "stuff" on it -- it's just a raw disk, so I believe
> > the g_wither_washer oddity does not apply in this situation.
> > 
> > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k"
> > 
> > 3. Drive hits a bad sector which it cannot remap/deal with.  Drive
> > firmware design flaw results in drive becoming 100% stuck trying to
> > re-read the sector and work out internal decisions to do remapping or
> > not.  Drive audibly clicking during this time (not actuator arm being
> > reset to track 0 noise; some other mechanical issue).  Due to firmware
> > issue, drive remains in this state indefinitely.
> > 
> > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ)
> > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5
> > times (kern.cam.da.retry_count+1).
> > 
> > 5. FreeBSD spits out similar messages you see; retries exhausted,
> > cam_periph_alloc error, and devfs claims device removal.
> > 
> > 6. Drive is still catatonic of course.  Only way to reset the drive is
> > to power-cycle it.  Drive removed from hot-swap bay, let sit for 20
> > seconds, then is reinserted.
> > 
> > 7. FreeBSD sees the disk reappear, shows up much like it did during #1,
> > except...
> > 
> > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type
> > (I forget which).  "ls -l /dev/ada5" shows an entry.  "camcontrol
> > devlist" shows the disk on the bus, yet I/O does not work.  If I
> > remember right, re-attempting the dd command returns some error (I
> > forget which).
> > 
> > 9. "camcontrol rescan all" stalls for quite some time when trying to
> > communicate with entry 5, but eventually does return (I think with some
> > error).  camcontrol reset all" works without a hitch.  "camcontrol
> > devlist" during this time shows the same disk on ada5 (which to me means
> > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning
> > I/O works at some level).
> > 
> > 10. System otherwise works fine, but the only way to bring back
> > usability of ada5 is to reboot ("shutdown -r now").
> > 
> > To me, this looks like FreeBSD at some layer within the kernel (or some
> > driver (I don't know which)) is internally confused about the true state
> > of things.
> > 
> > Alexander, do you have any ideas?
> > 
> > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle
> > this with camcontrol) as well as take notes and do a full step-by-step
> > diagnosis (along with relevant kernel output seen during each phase) if
> > that would help you.  And I can test patches but not against -CURRENT
> > (will be a cold day in hell before I run that, sorry).
> 
> Command timeout itself is not a reason for AHCI driver to drop the disk,
> neither it is for CAM in case of payload requests. Disk can be dropped
> if controller report device absence detected by SATA PHY, or by errors
> during device reinitialization after reset by CAM SATA XPT.

I have some theories as to why this is happening and it relates to the
underlying design of the drive firmware and the drive controller used.
I could write some pseudo-code showing how I believe the drive behaves,
but it's really besides the point, as you point out below.

> What is interesting, is what exactly goes on after disk got stuck and
> you have removed it. In normal case controller should immediately report
> PHY status change, driver should run PHY reset and see that link is
> lost. It should trigger bus rescan for CAM, that should invalidate
> device. That should make dd abort with error. After dd gone, device
> should be destroyed and ready for reattachment.

Yup, that sounds exactly like what should happen.  I know that in
userland (dd) the command eventually does abort/fail with an error (I
believe I/O error or some other message), and that's good.  The device
disappearing can also be confirmed.  It's after the drive is
power-cycled (to bring it back online) where its re-tasted and I/O (at
the kernel level) works, but now userland utilities interfacing with
/dev/ada5 insist "unknown device" or "no such device".  It's easier to
show than it is to explain.  My theory is that there is some kind of
internal (kernel-level) "state" that is not being reset correctly when a
device is lost and then brought back.

> So it should be great if you start with the full verbose dmesg from the
> boot up to the moment when system becomes stable after disk removal. If
> it won't be enough, we can enable some more debugging with `camcontrol
> debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`.

This is exactly what I needed; thank you!

I'll spend some time tomorrow collecting the data + documenting and will
provide the results once I've compiled them.  This will be more useful
than speculation on my part.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |