Date: Thu, 16 Aug 2012 19:10:07 GMT From: Dieter <freebsd@sopwith.solgatos.com> To: freebsd-gnats-submit@FreeBSD.org Subject: kern/170675: ata(4) hangs system, causing data loss Message-ID: <201208161910.q7GJA7iB050158@red.freebsd.org> Resent-Message-ID: <201208161920.q7GJK8wC090310@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 170675 >Category: kern >Synopsis: ata(4) hangs system, causing data loss >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Aug 16 19:20:08 UTC 2012 >Closed-Date: >Last-Modified: >Originator: Dieter >Release: FreeBSD 8.2 amd64 >Organization: >Environment: FreeBSD 8.2 amd64 >Description: FreeBSD 8.2 amd64 <nVidia nForce CK804 SATA300 controller> ad6 is a vanilla sata drive /var/log/messages contains: ad6: FAILURE - device detached No other clues are provided. It would be useful if ata(4) told us *why* it decided to detach the drive. Over 24 hours later, the system suddenly hung, for no obvious reason. Thinking that perhaps ata(4) was having some new problem with ad6, I unplugged ad6's data cable. The system then recovered. However, the system was completely hung for 19 minutes, and perhaps would have remained hung forever without manual intervention. THIS RESULTED IN THE UNNECESSARY LOSS OF INCOMING DATA! COMPLETELY UNACCEPTABLE! Other than the device detached message, ata(4) did not output any info at all about this problem. There is no reason that ata(4) should have to hang the entire system for even a millisecond, much less 19 minutes, just because it is having some problem with one disk drive. (ad6 contained only user data, no system partitions or swap) News Flash: hardware isn't perfect and never will be. Hardware sometimes hiccups or fails altogether. FreeBSD needs to deal with failures gracefully and continue servicing the remaining hardware. The phrase "can't walk and chew gum at the same time" comes to mind. I suspect that ata(4) turned off ALL interupts (why all of them? why not just turn off interrupts for the device being serviced?) and then went into an infinite loop. >How-To-Repeat: >Fix: (1) find the offending infinite loop (or whatever) in ata(4) and fix it. (2) Don't turn off all interrupts, just turn off interrupts for the device being serviced. >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201208161910.q7GJA7iB050158>