Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Feb 2013 19:57:08 -0800
From:      Jeremy Chadwick <jdc@koitsu.org>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org, avg@freebsd.org
Subject:   Re: disk "flipped" - a known problem?
Message-ID:  <20130223035708.GA23614@icarus.home.lan>
In-Reply-To: <20130221233609.GA92249@icarus.home.lan>
References:  <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> <20130125213209.GA1858@icarus.home.lan> <20130126011754.GA1806@icarus.home.lan> <51267055.3040500@FreeBSD.org> <20130221233609.GA92249@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 21, 2013 at 03:36:09PM -0800, Jeremy Chadwick wrote:
> On Thu, Feb 21, 2013 at 09:07:01PM +0200, Alexander Motin wrote:
> > On 26.01.2013 03:17, Jeremy Chadwick wrote:
> > > Okay, I've figured out the exact, 100% reproducible condition that
> > > causes the situation.  It took me a lot of tries and a digital pocket
> > > recorder to take verbal notes (there are just too many things to look at
> > > simultaneously), but I've figured it out.
> > > 
> > > I'm sorry for the verbosity, but it's necessary.
> > > 
> > > Assume the disk we're talking about is /dev/ada5.
> > > 
> > > 1. Prior to any issues, we have this:
> > > 
> > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
> > > crw-r-----  1 root  operator  0x8c Jan 25 16:41 /dev/ada5
> > > crw-------  1 root  operator  0x75 Jan 25 16:35 /dev/pass5
> > > crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0
> > > 
> > > 2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not
> > > get a response (not going to discuss how/why that can happen).
> > > 
> > > 3. These types of messages are seen on console (naturally the CDB and
> > > request type will vary -- in this case it was because I was doing the dd
> > > zero'ing, thus tickling the bad sector/naughty firmware on the drive):
> > > 
> > > Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0
> > > Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017
> > > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset...
> > > Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us status=00000113
> > > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found
> > > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED.  ACB: 61 80 80 77 01 40 00 00 00 00 00 00
> > > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
> > > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command
> > > 
> > > 4. Any I/O submit to ada5 during this time blocks (this is normal).
> > > 
> > > 5. **While this situation is happening**, something using xpt(4)
> > > attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5).
> > > This request also blocks (again, normal).
> > > 
> > > 6. Physical device falls off bus, or CAM kicks the disk off the bus.
> > > Doesn't matter which.  We see messages resembling this (boy am I tired
> > > of this interspersed output problem):
> > > 
> > > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device
> > > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device
> > > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry
> > > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: devfs entry is gone
> > > 
> > > 7. Standard I/O requests fail with errno=6 "Device not configured".
> > > xpt(4) requests also fail with the same errno.
> > > 
> > > 8. Device-wise, at this stage all we have is:
> > > 
> > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
> > > crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0
> > > 
> > > 9. Device comes back online for whatever reason.  FreeBSD sees the disk,
> > > blah blah blah:
> > > 
> > > Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5
> > > Jan 25 16:30:16 icarus kernel: ada5: <WDC WD1500ADFD-00NLR4 21.07QR4> ATA-7 SATA 1.x device
> > > Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589
> > > Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
> > > Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled
> > > Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C)
> > > Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14
> > > 
> > > ...um, where's pass5?
> > > 
> > > 10. /dev/pass5 is now completely (permanently) missing:
> > > 
> > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
> > > crw-r-----  1 root  operator  0x99 Jan 25 16:42 /dev/ada5
> > > crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0
> > > 
> > > 11. Any further attempts to communicate via xpt(4) with ada5 fail.
> > > Detaching and reattaching the disk does not fix the issue; the only fix
> > > is to reboot the system.
> > > 
> > > 12. "camcontrol debug -IPXp scbus5" results in tons and tons of output
> > > all pertaining to xpt(4).  It looks like xpt(4) is in some kind of
> > > loop.
> > > 
> > > Below is my verbose boot (with non-kernel things removed), which
> > > also includes "camcontrol debug" output once things are in a bad state:
> > > 
> > > http://jdc.koitsu.org/freebsd/xpt_oddity.log
> > > 
> > > In this log you'll see that after 1 CAM timeout I yanked the drive, then
> > > roughly 30 seconds later reinserted it.
> > > 
> > > If you need me to turn on CAM debugging *prior* to the above, I can do
> > > that, just let me know.
> > > 
> > > The important step is #5.  Without that, the problem shown in #9/10/11
> > > does not happen.
> > > 
> > > It's a good thing I don't run smartd(8) -- most users I see using that
> > > software set the interval to something like 180s or 60s.  Imagine this
> > > frustration: "okay so the disk fell off the bus, but what, now I can't
> > > talk to it with SMART?  Uhhh... <reboots>  Err, works now?  Whatever".
> > 
> > I think, the problem may already be fixed in HEAD by r244014 by ken@.
> > I've just merged it to 9-STABLE at r247115. So if it is still possible
> > to reproduce the situation, it would be good to try.
> 
> Yep, I saw the commit per svn-src-stable-9@freebsd.org, along with
> a bunch of others; I wasn't sure if r247114 or r247115 fixed it, so ws
> waiting for a follow-up from you.  :-)
> 
> I'll rebuild world/kernel and try it out + report back.  Thank you (and
> ken@ too!) for the work on this.

Got around to this today -- I can confirm as of r247132 on stable/9 the
above problem is gone.  Verification details below, for those who care:

Initial attachment:

Feb 22 19:40:32 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
Feb 22 19:40:32 icarus kernel: ada5: <ST3750630AS HP24> ATA-8 SATA 2.x device
Feb 22 19:40:32 icarus kernel: ada5: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
Feb 22 19:40:32 icarus kernel: ada5: Command Queueing enabled
Feb 22 19:40:32 icarus kernel: ada5: 715404MB (1465149168 512 byte sectors: 16H 63S/T 16383C)
Feb 22 19:40:32 icarus kernel: ada5: Previously was known as ad14

Ran dd if=/dev/zero of=/dev/ada5 bs=64k.  Timeouts occurring due to
physical issues with the disk itself:

Feb 22 19:44:01 icarus kernel: ahcich5: Timeout on slot 0 port 0
Feb 22 19:44:01 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017
Feb 22 19:44:01 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 00 0c 00 40 00 00 00 00 00 00
Feb 22 19:44:01 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
Feb 22 19:44:01 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command
Feb 22 19:44:33 icarus kernel: ahcich5: Timeout on slot 0 port 0
Feb 22 19:44:33 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017
Feb 22 19:44:33 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 00 18 00 40 00 00 00 00 00 00
Feb 22 19:44:33 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
Feb 22 19:44:33 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command

Initiated smartctl -a /dev/ada5, which blocked as expected.  Timeouts
still happening, and to speed up the process I yanked the disk:

Feb 22 19:45:32 icarus kernel: ahcich5: Timeout on slot 0 port 0
Feb 22 19:45:32 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017
Feb 22 19:45:32 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 80 25 00 40 00 00 00 00 00 00
Feb 22 19:45:32 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
Feb 22 19:45:32 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command
Feb 22 19:45:55 icarus kernel: (ada5:ahcich5:0:0:0): lost device
Feb 22 19:45:55 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry

After yanking:

root@icarus:~ # smartctl -a /dev/ada5
smartctl 6.0 2012-10-10 r3643 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

(pass5:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 01 00
(pass5:ahcich5:0:0:0): CAM status: Unconditionally Re-queue Request
smartctl: cam_send_ccb: Device not configured

Checking devices:

root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
crw-------  1 root  operator  0x52 Feb 21 21:37 /dev/xpt0

Reinserted disk:

Feb 22 19:47:53 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
Feb 22 19:47:53 icarus kernel: ada5: <ST3750630AS HP24> ATA-8 SATA 2.x device
Feb 22 19:47:53 icarus kernel: ada5: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
Feb 22 19:47:53 icarus kernel: ada5: Command Queueing enabled
Feb 22 19:47:53 icarus kernel: ada5: 715404MB (1465149168 512 byte sectors: 16H 63S/T 16383C)
Feb 22 19:47:53 icarus kernel: ada5: Previously was known as ad14

Devices look good:

root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
crw-r-----  1 root  operator  0x98 Feb 22 19:47 /dev/ada5
crw-------  1 root  operator  0x96 Feb 22 19:47 /dev/pass5
crw-------  1 root  operator  0x52 Feb 21 21:37 /dev/xpt0

And smartctl works fine.  :-)

(Footnote for readers: The previous WD disk I was testing with went
belly up in an even worse way so I couldn't use it, but thankfully I've
lots of bad disks that exhibit repeated timeouts during I/O.  *pats his
angry ST3750630AS*)

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130223035708.GA23614>