Date: Thu, 21 Feb 2013 21:07:01 +0200 From: Alexander Motin <mav@FreeBSD.org> To: Jeremy Chadwick <jdc@koitsu.org> Cc: freebsd-fs@freebsd.org, avg@freebsd.org Subject: Re: disk "flipped" - a known problem? Message-ID: <51267055.3040500@FreeBSD.org> In-Reply-To: <20130126011754.GA1806@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> <20130125213209.GA1858@icarus.home.lan> <20130126011754.GA1806@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On 26.01.2013 03:17, Jeremy Chadwick wrote: > Okay, I've figured out the exact, 100% reproducible condition that > causes the situation. It took me a lot of tries and a digital pocket > recorder to take verbal notes (there are just too many things to look at > simultaneously), but I've figured it out. > > I'm sorry for the verbosity, but it's necessary. > > Assume the disk we're talking about is /dev/ada5. > > 1. Prior to any issues, we have this: > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* > crw-r----- 1 root operator 0x8c Jan 25 16:41 /dev/ada5 > crw------- 1 root operator 0x75 Jan 25 16:35 /dev/pass5 > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 > > 2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not > get a response (not going to discuss how/why that can happen). > > 3. These types of messages are seen on console (naturally the CDB and > request type will vary -- in this case it was because I was doing the dd > zero'ing, thus tickling the bad sector/naughty firmware on the drive): > > Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0 > Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset... > Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us status=00000113 > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 80 77 01 40 00 00 00 00 00 00 > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command > > 4. Any I/O submit to ada5 during this time blocks (this is normal). > > 5. **While this situation is happening**, something using xpt(4) > attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5). > This request also blocks (again, normal). > > 6. Physical device falls off bus, or CAM kicks the disk off the bus. > Doesn't matter which. We see messages resembling this (boy am I tired > of this interspersed output problem): > > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: devfs entry is gone > > 7. Standard I/O requests fail with errno=6 "Device not configured". > xpt(4) requests also fail with the same errno. > > 8. Device-wise, at this stage all we have is: > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 > > 9. Device comes back online for whatever reason. FreeBSD sees the disk, > blah blah blah: > > Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5 > Jan 25 16:30:16 icarus kernel: ada5: <WDC WD1500ADFD-00NLR4 21.07QR4> ATA-7 SATA 1.x device > Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589 > Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes) > Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled > Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C) > Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14 > > ...um, where's pass5? > > 10. /dev/pass5 is now completely (permanently) missing: > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* > crw-r----- 1 root operator 0x99 Jan 25 16:42 /dev/ada5 > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 > > 11. Any further attempts to communicate via xpt(4) with ada5 fail. > Detaching and reattaching the disk does not fix the issue; the only fix > is to reboot the system. > > 12. "camcontrol debug -IPXp scbus5" results in tons and tons of output > all pertaining to xpt(4). It looks like xpt(4) is in some kind of > loop. > > Below is my verbose boot (with non-kernel things removed), which > also includes "camcontrol debug" output once things are in a bad state: > > http://jdc.koitsu.org/freebsd/xpt_oddity.log > > In this log you'll see that after 1 CAM timeout I yanked the drive, then > roughly 30 seconds later reinserted it. > > If you need me to turn on CAM debugging *prior* to the above, I can do > that, just let me know. > > The important step is #5. Without that, the problem shown in #9/10/11 > does not happen. > > It's a good thing I don't run smartd(8) -- most users I see using that > software set the interval to something like 180s or 60s. Imagine this > frustration: "okay so the disk fell off the bus, but what, now I can't > talk to it with SMART? Uhhh... <reboots> Err, works now? Whatever". I think, the problem may already be fixed in HEAD by r244014 by ken@. I've just merged it to 9-STABLE at r247115. So if it is still possible to reproduce the situation, it would be good to try. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51267055.3040500>