Date: Tue, 13 Sep 2005 18:30:56 -0600 From: Anthony Chavez <acc@anthonychavez.org> To: freebsd-stable@freebsd.org Subject: Re: Stress testing and TIMEOUT - WRITE_DMA Message-ID: <m24q8owkvz.fsf@pegasos.local> References: <m2br3lt5nk.fsf@pegasos.local> <m2slwbqrxf.fsf@pegasos.local> <1275346059.20050911223347@rulez.sk> <20050912061917.GP69713@pleiades.aeternal.net>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-= On Mon, 12 Sep 2005 08:19:18 +0200 martin hudec <corwin@aeternal.net> wrote: > On Sun, Sep 11, 2005 at 10:33:47PM +0200 or thereabouts, Daniel Gerzo wrote: >> On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc@anthonychavez.org> >> wrote: >> > Sep 6 11:35:27 mybox kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=8348191 >> > ... >> > Sep 6 18:59:09 mybox kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=8348383 >> > Sep 6 19:04:58 mybox kernel: ad0: TIMEOUT - READ_DMA retrying (2 retries left) LBA=61749183 >> >> > The READ_DMA timeouts are happening very infrequently, but it's worth >> > mentioning that I'm seeing them now in addition. >> >> > This is quite disturbing, particularly when the machine in question is >> > *in*production.* >> >> I thing you should really quickly look for backuping your data. When >> I was seeing this kind of messages last time, my disk died after 3 >> days from time they started showing up in my log files. I wasn't able >> to write any data to the disk (system just sudennly paniced, when >> I tried to mount it rw, but I was able to mount it ro and copy most of >> the data) Note, that I wasn't able to copy about 10GB out of 30GB. So >> don't ignore them and have a good luck. > > Hmmm, before trashing that disk, you could surely consider running > smartmontools to see what they have to say about health condition of > your disk :).. go for sysutils/smartmontools. Okay, I've actually got 3 identical drives (SAMSUNG SP0802N) in 3 identical systems, running identical hardware using Intel ICH4 controllers. Only one of these machines managed to spit 81 errors at me over a period of about 6.5 hours (so far). This particular machine produced the warnings after approximately 8 days after installing FreeBSD. Ironically, another one of these machines only produced 1 warning after nearly 21 days and then another solitary warning 14 days after that (which occurred as I was drafting this response). smartctl reports each of these drives passes the "SMART overall-health self-assessment test" but goes on to report exactly 6 "SET MAX ADDRESS [OBS-6]" errors occur for each drive within 1 hour of uptime. I do not think that any of these errors occured at the same time the DMA warnings did. > After that can one make assumptions whether it is faulty hardware or > ata patches :). Well, the drives are pretty much brand new. I think that it's safe to assume that the health of these drives are not a concern, and smartctl seems to confirm this. On Mon, 12 Sep 2005 15:53:27 +0200 MaXX <bs139412@skynet.be> wrote: > On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc@anthonychavez.org> > wrote: >> My question is simply this: is the fact that I received 4 TIMEOUT >> warnings in the space of roughly 2 weeks significant cause for concern? > Hi, > You may have a look at this pr :85603 (FS corruption and 'uncorrectable' DMA > errors on ATA disks after unclean shutdown) and see if that applies for you. Thanks. My hardware doesn't match, but I'll keep it in mind. > Are you running a kernel built around mid June this year? The machine that gave me 81 warnings after applying ata-mk3n: FreeBSD 5.4-RELEASE-p6 #0: Sun Sep 11 21:57:16 MDT 2005 root@mybox1:/usr/obj/usr/src/sys/MYBOX1 The machine that's been in commission the longest: FreeBSD 5.4-RELEASE #0: Sun Sep 11 21:46:18 MDT 2005 root@mybox2:/usr/obj/usr/src/sys/MYBOX2 New kid on the block: FreeBSD 5.4-RELEASE-p6 #0: Sun Sep 11 21:58:08 MDT 2005 root@mybox3:/usr/obj/usr/src/sys/MYBOX3 FWIW, although they have different names, the kernel configs are exactly the same. > Did your machine paniced before the DMA problems appears (I think a power > faillure can do the trick too)? No panic. However, I recall reading that these warnings are a good indication that a panic may be imminent, hence my call for help. > In our case this problem was fixed by newfs, even smartctl > (sysutils/smartmontool) did report errors at the drive level. After newfs'ing > the disk no more message (but they still in the drive's log). That seems very strange, particularly when I have newfs'ed the disks when installing FreeBSD. Furthermore, this solution is not sufficient. The machines that are giving me this error are in crucial locations and I need to know what causes these errors and if a fix is available or if I really should worry about a few popping up now and then. -- Anthony Chavez http://anthonychavez.org/ mailto:acc@anthonychavez.org jabber:acc@jabber.anthonychavez.org --=-=-= Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (Darwin) iQEVAwUAQydvQ/AIdTFWAbdTAQr1Dwf/TsjpTVQe+G1rSGFu2kQuJYouHFYILtK8 LVJUHhJXUUcl5ktq03msOmKDbzr8qr0K14cM6IEDd8Y3lSvWKxrfiD5EZOyVzZSb wlA+j6UYQAHpZJN2WmW9K3hatRarjwHKfewNFyNteMXmtuizpmbcxgxII/RbbIYf OKsfVaO7j0vNxjuL6YI/n6WaOzLH63rIt6RwpXhpnhUKA7zLaU3IKjYE1KvQ8AOQ kreXUKXYhYljb6C0ha7dLUvaLr5b5cF3p2qcVvifN/p04l3UbTAyxPRNXMB16Cq4 OfIYxJoHqC810LRmPMkjjXuRKiCThbMkg+B6FhtV3N1SNLRhiLvM5w== =qYna -----END PGP SIGNATURE----- --=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m24q8owkvz.fsf>