Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Sep 2005 18:30:56 -0600
From:      Anthony Chavez <acc@anthonychavez.org>
To:        freebsd-stable@freebsd.org
Subject:   Re: Stress testing and TIMEOUT - WRITE_DMA
Message-ID:  <m24q8owkvz.fsf@pegasos.local>
References:  <m2br3lt5nk.fsf@pegasos.local> <m2slwbqrxf.fsf@pegasos.local> <1275346059.20050911223347@rulez.sk> <20050912061917.GP69713@pleiades.aeternal.net>

next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-=

On Mon, 12 Sep 2005 08:19:18 +0200 martin hudec <corwin@aeternal.net> wrote:

> On Sun, Sep 11, 2005 at 10:33:47PM +0200 or thereabouts, Daniel Gerzo wrote:
>> On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc@anthonychavez.org> 
>> wrote:
>> > Sep  6 11:35:27 mybox kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=8348191
>> > ...
>> > Sep  6 18:59:09 mybox kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=8348383
>> > Sep  6 19:04:58 mybox kernel: ad0: TIMEOUT - READ_DMA retrying (2 retries left) LBA=61749183
>> 
>> > The READ_DMA timeouts are happening very infrequently, but it's worth
>> > mentioning that I'm seeing them now in addition.
>> 
>> > This is quite disturbing, particularly when the machine in question is
>> > *in*production.*
>> 
>> I thing you should really quickly look for backuping your data. When
>> I was seeing this kind of messages last time, my disk died after 3
>> days from time they started showing up in my log files. I wasn't able
>> to write any data to the disk (system just sudennly paniced, when
>> I tried to mount it rw, but I was able to mount it ro and copy most of
>> the data) Note, that I wasn't able to copy about 10GB out of 30GB. So
>> don't ignore them and have a good luck.
>
>   Hmmm, before trashing that disk, you could surely consider running
>   smartmontools to see what they have to say about health condition of
>   your disk :).. go for sysutils/smartmontools.

Okay, I've actually got 3 identical drives (SAMSUNG SP0802N) in 3
identical systems, running identical hardware using Intel ICH4
controllers.

Only one of these machines managed to spit 81 errors at me over a period
of about 6.5 hours (so far).  This particular machine produced the
warnings after approximately 8 days after installing FreeBSD.
Ironically, another one of these machines only produced 1 warning after
nearly 21 days and then another solitary warning 14 days after that
(which occurred as I was drafting this response).

smartctl reports each of these drives passes the "SMART overall-health
self-assessment test" but goes on to report exactly 6 "SET MAX ADDRESS
[OBS-6]" errors occur for each drive within 1 hour of uptime.  I do not
think that any of these errors occured at the same time the DMA warnings
did.

>   After that can one make assumptions whether it is faulty hardware or
>   ata patches :).

Well, the drives are pretty much brand new.  I think that it's safe to
assume that the health of these drives are not a concern, and smartctl
seems to confirm this.

On Mon, 12 Sep 2005 15:53:27 +0200 MaXX <bs139412@skynet.be> wrote:

> On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc@anthonychavez.org> 
> wrote:
>> My question is simply this: is the fact that I received 4 TIMEOUT
>> warnings in the space of roughly 2 weeks significant cause for concern?
> Hi,
> You may have a look at this pr :85603  (FS corruption and 'uncorrectable' DMA 
> errors on ATA disks after unclean shutdown) and see if that applies for you.

Thanks.  My hardware doesn't match, but I'll keep it in mind.

> Are you running a kernel built around mid June this year?

The machine that gave me 81 warnings after applying ata-mk3n:

FreeBSD 5.4-RELEASE-p6 #0: Sun Sep 11 21:57:16 MDT 2005     root@mybox1:/usr/obj/usr/src/sys/MYBOX1

The machine that's been in commission the longest:

FreeBSD 5.4-RELEASE #0: Sun Sep 11 21:46:18 MDT 2005     root@mybox2:/usr/obj/usr/src/sys/MYBOX2

New kid on the block:

FreeBSD 5.4-RELEASE-p6 #0: Sun Sep 11 21:58:08 MDT 2005     root@mybox3:/usr/obj/usr/src/sys/MYBOX3

FWIW, although they have different names, the kernel configs are exactly
the same.

> Did your machine paniced before the DMA problems appears (I think a power 
> faillure can do the trick too)?

No panic.  However, I recall reading that these warnings are a good
indication that a panic may be imminent, hence my call for help.

> In our case this problem was fixed by newfs, even smartctl 
> (sysutils/smartmontool) did report errors at the drive level. After newfs'ing 
> the disk no more message (but they still in the drive's log). 

That seems very strange, particularly when I have newfs'ed the disks
when installing FreeBSD.

Furthermore, this solution is not sufficient.  The machines that are
giving me this error are in crucial locations and I need to know what
causes these errors and if a fix is available or if I really should
worry about a few popping up now and then.

-- 
Anthony Chavez                                 http://anthonychavez.org/
mailto:acc@anthonychavez.org         jabber:acc@jabber.anthonychavez.org

--=-=-=
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)

iQEVAwUAQydvQ/AIdTFWAbdTAQr1Dwf/TsjpTVQe+G1rSGFu2kQuJYouHFYILtK8
LVJUHhJXUUcl5ktq03msOmKDbzr8qr0K14cM6IEDd8Y3lSvWKxrfiD5EZOyVzZSb
wlA+j6UYQAHpZJN2WmW9K3hatRarjwHKfewNFyNteMXmtuizpmbcxgxII/RbbIYf
OKsfVaO7j0vNxjuL6YI/n6WaOzLH63rIt6RwpXhpnhUKA7zLaU3IKjYE1KvQ8AOQ
kreXUKXYhYljb6C0ha7dLUvaLr5b5cF3p2qcVvifN/p04l3UbTAyxPRNXMB16Cq4
OfIYxJoHqC810LRmPMkjjXuRKiCThbMkg+B6FhtV3N1SNLRhiLvM5w==
=qYna
-----END PGP SIGNATURE-----
--=-=-=--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m24q8owkvz.fsf>