From owner-freebsd-bugs Sat Jun 8 2:53: 5 2002 Delivered-To: freebsd-bugs@freebsd.org Received: from prometheus.vh.laserfence.net (prometheus.laserfence.net [196.44.73.116]) by hub.freebsd.org (Postfix) with ESMTP id 4F21E37B404 for ; Sat, 8 Jun 2002 02:52:56 -0700 (PDT) Received: from phoenix.vh.laserfence.net ([192.168.0.10]) by prometheus.vh.laserfence.net with esmtp (Exim 3.34 #1) id 17Gcto-0001su-00 for freebsd-bugs@freebsd.org; Sat, 08 Jun 2002 11:52:44 +0200 Date: Sat, 8 Jun 2002 11:52:44 +0200 (SAST) From: Willie Viljoen X-X-Sender: will@phoenix.vh.laserfence.net To: freebsd-bugs@freebsd.org Subject: Problem with ATA driver, older hard disks and SMP Message-ID: <20020608114421.I464-100000@phoenix.vh.laserfence.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org I previously submitted a less informed report about this one, but I think I'm certain it's in the ATA code now. I left the machine alone for about a month, and yesterday was forced to reboot by a power failure, also forcing an upgrade from 4.5-RELEASE(GENERIC) which was running as an interum measure to 4.6-RC which was compiled some days prior. 4.6-RC (compiled with SMP support) again failed very shortly after bootup, while init was still underway. I have no log messages, because the logs nolonger function after the failure, but it is a message from the ata driver that ad0 has had an IRQ/DMA timeout error. At this point, the machine simply freeses, to such an extent that even keystrokes have no effect, and the screensaver does not kick in. The machine is a server, so not usually connected to a screen, however, when I entered the server room yesterday, about an hour after the power failure, I found it was jammed solid, and upon pluggin in the screen, it was stopped in the middle of the boot process, having the ad0 timeout message as the last thing to appear on screen, right after some dribble about daemons from init. The screen saver (blankscreen after 5 minutes) had not taken effect, and the machine was also not responding to keystokes, not even to the special reboot and diag sequences that these old Dell machines (Dell PowerEdge SP590-2) have. Out of desparation (and not wanting to again revert back to the non-SMP 4.5-RELEASE kernel) I decided to remote the new ata driver and replace it with the obsolete wdc and wd code. The machine now appears to be stable, and I have recieved similar messages from the wdc drivers about wd0, however, processing continues after that point, without causing the odd kernel lockup I mentioned above. The timeout errors now take the form: wd0: interrupt timeout (status 58 error 0) wd0: interrupt timeout (status 58 error 1) wd0: interrupt timeout (status 58 error 1) wd0: interrupt timeout (status 58 error 1) wd0: interrupt timeout (status 50 error 1) wd0: Last time I say: interrupt timeout. Probably a portable PC. (status 58 error 1) These pop up with a number of hours between them, until the last. However, after their appearance, the machine simply seems to continue normally. The timeouts don't bother me that much, as the drive is a very old Seagate ST31720A, which are renowned for problems. I simply use it as the root and usr file systems, my main storage is on a SCSI RAID array. When running in uniprocessor mode, the ATA driver outputs a similar message to the one the appears right before the SMP crashes, but the system also continues normally, so it is my understanding that SMP code and the ata driver somehow cause the system to go into the state of hard lock I described, right after the drive causes a timeout. Regards Will -- Willie Viljoen Private IT Consultant 214 Paul Kruger Avenue Universitas Bloemfontein 9321 South Africa +27 51 522 15 60, a/h +27 51 522 44 36 +27 82 404 03 27 will@laserfence.net To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message