Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 8 Jun 2002 11:52:44 +0200 (SAST)
From:      Willie Viljoen <will@laserfence.net>
To:        freebsd-bugs@freebsd.org
Subject:   Problem with ATA driver, older hard disks and SMP
Message-ID:  <20020608114421.I464-100000@phoenix.vh.laserfence.net>

next in thread | raw e-mail | index | archive | help
I previously submitted a less informed report about this one, but I think
I'm certain it's in the ATA code now.

I left the machine alone for about a month, and yesterday was forced to
reboot by a power failure, also forcing an upgrade from
4.5-RELEASE(GENERIC) which was running as an interum measure to 4.6-RC
which was compiled some days prior.

4.6-RC (compiled with SMP support) again failed very shortly after bootup,
while init was still underway.

I have no log messages, because the logs nolonger function after the
failure, but it is a message from the ata driver that ad0 has had an
IRQ/DMA timeout error.

At this point, the machine simply freeses, to such an extent that even
keystrokes have no effect, and the screensaver does not kick in. The
machine is a server, so not usually connected to a screen, however, when I
entered the server room yesterday, about an hour after the power failure,
I found it was jammed solid, and upon pluggin in the screen, it was
stopped in the middle of the boot process, having the ad0 timeout message
as the last thing to appear on screen, right after some dribble about
daemons from init. The screen saver (blankscreen after 5 minutes) had not
taken effect, and the machine was also not responding to keystokes, not
even to the special reboot and diag sequences that these old Dell machines
(Dell PowerEdge SP590-2) have.

Out of desparation (and not wanting to again revert back to the non-SMP
4.5-RELEASE kernel) I decided to remote the new ata driver and replace it
with the obsolete wdc and wd code.

The machine now appears to be stable, and I have recieved similar messages
from the wdc drivers about wd0, however, processing continues after that
point, without causing the odd kernel lockup I mentioned above.

The timeout errors now take the form:
wd0: interrupt timeout (status 58<rdy,seekdone,drq> error 0)
wd0: interrupt timeout (status 58<rdy,seekdone,drq> error 1<no_dam>)
wd0: interrupt timeout (status 58<rdy,seekdone,drq> error 1<no_dam>)
wd0: interrupt timeout (status 58<rdy,seekdone,drq> error 1<no_dam>)
wd0: interrupt timeout (status 50<rdy,seekdone> error 1<no_dam>)
wd0: Last time I say: interrupt timeout.  Probably a portable PC. (status 58<rdy,seekdone,drq> error 1<no_dam>)

These pop up with a number of hours between them, until the last. However,
after their appearance, the machine simply seems to continue normally.

The timeouts don't bother me that much, as the drive is a very old Seagate
ST31720A, which are renowned for problems. I simply use it as the root and
usr file systems, my main storage is on a SCSI RAID array.

When running in uniprocessor mode, the ATA driver outputs a similar
message to the one the appears right before the SMP crashes, but the
system also continues normally, so it is my understanding that SMP code
and the ata driver somehow cause the system to go into the state of hard
lock I described, right after the drive causes a timeout.

Regards
Will

-- 
Willie Viljoen
Private IT Consultant

214 Paul Kruger Avenue
Universitas
Bloemfontein
9321

South Africa

+27 51 522 15 60, a/h +27 51 522 44 36
+27 82 404 03 27

will@laserfence.net


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020608114421.I464-100000>