Date: Tue, 5 Aug 2008 20:30:16 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: "Sean C. Farley" <scf@FreeBSD.org> Cc: freebsd-stable@FreeBSD.org Subject: Re: Stuck in geli Message-ID: <20080806033016.GA35921@eos.sc1.parodius.com> In-Reply-To: <alpine.BSF.1.10.0808051023220.1056@thor.farley.org> References: <alpine.BSF.1.10.0808051023220.1056@thor.farley.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 05, 2008 at 10:45:16AM -0500, Sean C. Farley wrote: > Rarely, a geli partition I have freezes a process in bufwait state. It > occurs after an ATA timeout message: > Aug 5 03:47:13 thor kernel: ad10: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=219028637 This looks like the issue I've been tracking for months now. I'm sorry the document isn't complete; it's an issue of time... http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting > The geli partition resides on an Intel MatrixRAID RAID1 mirror using the > ICH9R chipset (Asus P5K-E/WIFI). Killing (even -9) the process does not > work. Rebooting is the only solution, yet the system is unable to flush > the buffers and complete a clean unmounting. After reading my above Wiki page, I hope you consider disabling MatrixRAID and avoiding it entirely on FreeBSD. There are patches to address major issues which have been sitting untouched, despite patches included, for 2+ years. Draw your own conclusions. Also, you won't be able to kill -9 a process in that state. The kernel (or some piece of it) is hung, not the process. The fact that a reboot is required also does not surprise me. You *might* have been able to detach the ATA/SATA channel using atacontrol to get access to the system, but then again it might result in a system panic (see Wiki). > Both drives in the mirror have both survived a smartctl -t offline scan. This doesn't really mean anything; the SMART statistics, self-test log, and error log are what's more important. Chances are it's not a disk issue though... > Also, a previous time it was with ad12, so I strongly doubt it is the > drive. It seems like a geli partition is unable to handle a timeout > from a drive. The problem is not with geli(4), as I see it. > ad10: > Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family > Device Model: ST3160827AS > > ad12: > Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family > Device Model: ST3160827AS My experiences with disk timeouts on FreeBSD is that the OS does not handle it well at all, regardless of geli(4) being used or not. The entire system can deadlock, and in some cases panic (which for me is the more common result). I can't help myself here -- Linux's libata handles this much more elegantly. In the case of a failure similar to the above, there is a brief system deadlock and then full system recovery with EIO (I/O error) being returned to any process stuck in that state. There *is* data loss, but I don't think there's anything one can do about that (on Linux or FreeBSD); journalling filesystems should solve that aspect. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080806033016.GA35921>