FreeBSD Mail Archives

Date:      Tue, 5 Aug 2008 20:30:16 -0700
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        "Sean C. Farley" <scf@FreeBSD.org>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: Stuck in geli
Message-ID:  <20080806033016.GA35921@eos.sc1.parodius.com>
In-Reply-To: <alpine.BSF.1.10.0808051023220.1056@thor.farley.org>
References:  <alpine.BSF.1.10.0808051023220.1056@thor.farley.org>

On Tue, Aug 05, 2008 at 10:45:16AM -0500, Sean C. Farley wrote:
> Rarely, a geli partition I have freezes a process in bufwait state.  It
> occurs after an ATA timeout message:
> Aug  5 03:47:13 thor kernel: ad10: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=219028637

This looks like the issue I've been tracking for months now.  I'm sorry
the document isn't complete; it's an issue of time...

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

> The geli partition resides on an Intel MatrixRAID RAID1 mirror using the
> ICH9R chipset (Asus P5K-E/WIFI).  Killing (even -9) the process does not
> work.  Rebooting is the only solution, yet the system is unable to flush
> the buffers and complete a clean unmounting.

After reading my above Wiki page, I hope you consider disabling
MatrixRAID and avoiding it entirely on FreeBSD.  There are patches to
address major issues which have been sitting untouched, despite patches
included, for 2+ years.  Draw your own conclusions.

Also, you won't be able to kill -9 a process in that state.  The kernel
(or some piece of it) is hung, not the process.  The fact that a reboot
is required also does not surprise me.

You *might* have been able to detach the ATA/SATA channel using
atacontrol to get access to the system, but then again it might result
in a system panic (see Wiki).

> Both drives in the mirror have both survived a smartctl -t offline scan.

This doesn't really mean anything; the SMART statistics, self-test
log, and error log are what's more important.  Chances are it's not a
disk issue though...

> Also, a previous time it was with ad12, so I strongly doubt it is the
> drive.  It seems like a geli partition is unable to handle a timeout
> from a drive.

The problem is not with geli(4), as I see it.

> ad10:
> Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus family
> Device Model:     ST3160827AS
>
> ad12:
> Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus family
> Device Model:     ST3160827AS

My experiences with disk timeouts on FreeBSD is that the OS does not
handle it well at all, regardless of geli(4) being used or not.  The
entire system can deadlock, and in some cases panic (which for me is
the more common result).

I can't help myself here -- Linux's libata handles this much more
elegantly.  In the case of a failure similar to the above, there is a
brief system deadlock and then full system recovery with EIO (I/O error)
being returned to any process stuck in that state.  There *is* data
loss, but I don't think there's anything one can do about that (on Linux
or FreeBSD); journalling filesystems should solve that aspect.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080806033016.GA35921>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation