Date: Fri, 25 Jan 2013 13:25:59 -0800 From: Jeremy Chadwick <jdc@koitsu.org> To: Alexander Motin <mav@FreeBSD.org> Cc: freebsd-fs@freebsd.org, avg@freebsd.org Subject: Re: disk "flipped" - a known problem? Message-ID: <20130125212559.GA1772@icarus.home.lan> In-Reply-To: <20130125211232.GA3037@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jan 25, 2013 at 01:12:32PM -0800, Jeremy Chadwick wrote: > On Fri, Jan 25, 2013 at 12:36:19AM -0800, Jeremy Chadwick wrote: > > On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote: > > > On 22.01.2013 00:16, Jeremy Chadwick wrote: > > > > (Please keep me CC'd as I am not subscribed) > > > > > > > > WRT this: > > > > > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > > > > > > I can reproduce the first problem 100% of the time on my home system > > > > here. I can provide hardware specs if needed, but the important part is > > > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > > > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > > > > using ahci.ko. > > > > > > > > I also want to make this clear to Andriy: I'm not saying "there's a > > > > problem with your disk". In my case, I KNOW there's a problem with the > > > > disk (that's the entire point to my tests! :-) ). > > > > > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > > > > badly-designed firmware that goes completely catatonic when encountering > > > > certain sector-level conditions. That's not the problem though -- the > > > > problem is with FreeBSD apparently getting confused as to the internal > > > > state of its devices after a device falls off the bus and comes back. > > > > Explanation: > > > > > > > > 1. System powered off; disk is attached; system powered on, shows up as > > > > ada5. Can communicate with device in every way (the way I tend to test > > > > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > > > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > > > > the g_wither_washer oddity does not apply in this situation. > > > > > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > > > > > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > > > > firmware design flaw results in drive becoming 100% stuck trying to > > > > re-read the sector and work out internal decisions to do remapping or > > > > not. Drive audibly clicking during this time (not actuator arm being > > > > reset to track 0 noise; some other mechanical issue). Due to firmware > > > > issue, drive remains in this state indefinitely. > > > > > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > > > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > > > > times (kern.cam.da.retry_count+1). > > > > > > > > 5. FreeBSD spits out similar messages you see; retries exhausted, > > > > cam_periph_alloc error, and devfs claims device removal. > > > > > > > > 6. Drive is still catatonic of course. Only way to reset the drive is > > > > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > > > > seconds, then is reinserted. > > > > > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > > > > except... > > > > > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > > > > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > > > > devlist" shows the disk on the bus, yet I/O does not work. If I > > > > remember right, re-attempting the dd command returns some error (I > > > > forget which). > > > > > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to > > > > communicate with entry 5, but eventually does return (I think with some > > > > error). camcontrol reset all" works without a hitch. "camcontrol > > > > devlist" during this time shows the same disk on ada5 (which to me means > > > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > > > > I/O works at some level). > > > > > > > > 10. System otherwise works fine, but the only way to bring back > > > > usability of ada5 is to reboot ("shutdown -r now"). > > > > > > > > To me, this looks like FreeBSD at some layer within the kernel (or some > > > > driver (I don't know which)) is internally confused about the true state > > > > of things. > > > > > > > > Alexander, do you have any ideas? > > > > > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > > > > this with camcontrol) as well as take notes and do a full step-by-step > > > > diagnosis (along with relevant kernel output seen during each phase) if > > > > that would help you. And I can test patches but not against -CURRENT > > > > (will be a cold day in hell before I run that, sorry). > > > > > > Command timeout itself is not a reason for AHCI driver to drop the disk, > > > neither it is for CAM in case of payload requests. Disk can be dropped > > > if controller report device absence detected by SATA PHY, or by errors > > > during device reinitialization after reset by CAM SATA XPT. > > > > I have some theories as to why this is happening and it relates to the > > underlying design of the drive firmware and the drive controller used. > > I could write some pseudo-code showing how I believe the drive behaves, > > but it's really besides the point, as you point out below. > > > > > What is interesting, is what exactly goes on after disk got stuck and > > > you have removed it. In normal case controller should immediately report > > > PHY status change, driver should run PHY reset and see that link is > > > lost. It should trigger bus rescan for CAM, that should invalidate > > > device. That should make dd abort with error. After dd gone, device > > > should be destroyed and ready for reattachment. > > > > Yup, that sounds exactly like what should happen. I know that in > > userland (dd) the command eventually does abort/fail with an error (I > > believe I/O error or some other message), and that's good. The device > > disappearing can also be confirmed. It's after the drive is > > power-cycled (to bring it back online) where its re-tasted and I/O (at > > the kernel level) works, but now userland utilities interfacing with > > /dev/ada5 insist "unknown device" or "no such device". It's easier to > > show than it is to explain. My theory is that there is some kind of > > internal (kernel-level) "state" that is not being reset correctly when a > > device is lost and then brought back. > > > > > So it should be great if you start with the full verbose dmesg from the > > > boot up to the moment when system becomes stable after disk removal. If > > > it won't be enough, we can enable some more debugging with `camcontrol > > > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. > > > > This is exactly what I needed; thank you! > > > > I'll spend some time tomorrow collecting the data + documenting and will > > provide the results once I've compiled them. This will be more useful > > than speculation on my part. > > Finished. > > http://jdc.koitsu.org/freebsd/ahci_cam_testing/ > > I was not able to get the cam_periph_alloc error message; I'll talk > about that at the end of my mail in attempt to stay focused on what I > did find / what I was able to reproduce. > > Things start to get interesting around phase 23. > > Phase 31 is where things are confirmed broken in some way ("no such file > or directory" even though /dev/ada5 is there). > > Direct I/O to /dev/ada5 still works (shown in phase 33), but smartctl > ceases to work ***to that device only*** from then onward (e.g. smartctl > ada0 works fine). A reboot is needed to recover from this. > > I'm aware that smartmontools uses xpt(4), and I think therein is where > the issue is. The only difference between the tests/phases is that I > issued "camcontrol reset" and "camcontrol rescan" prior to the breakage. > > Based on CAMDEBUG output in phase 36 it looks like xpt(4) is spinning on > something internally and causing what I'm reporting. I can reproduce it > reliably at least. > > Let me know what else I can do to help. If I need to turn on CAMDEBUG > and re-do some of the tests + provide full kernel/CAMDEBUG output during > each phase, no problem, just let me know what you need. I just hate > risking interspersed kernel printf output... I just realised that the numbering scheme I was using for the phases is *completely* buggered. I obviously made a manual typo at some point and it just proliferated from there. I'll see if I can figure out where my mistake was and clean it up. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130125212559.GA1772>