From owner-freebsd-questions@FreeBSD.ORG  Sun Jul  6 10:14:41 2003
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 851DE37B401
	for <freebsd-questions@freebsd.org>;
	Sun,  6 Jul 2003 10:14:41 -0700 (PDT)
Received: from mail.gmx.net (imap.gmx.net [213.165.64.20])
	by mx1.FreeBSD.org (Postfix) with SMTP id 6F3BD43F93
	for <freebsd-questions@freebsd.org>;
	Sun,  6 Jul 2003 10:14:40 -0700 (PDT)
	(envelope-from blueeskimo@gmx.net)
Received: (qmail 13596 invoked by uid 65534); 6 Jul 2003 17:14:38 -0000
Received: from dsl-cust-145.openweb.ca (EHLO [64.39.186.145]) (64.39.186.145)
  by mail.gmx.net (mp025) with SMTP; 06 Jul 2003 19:14:38 +0200
From: Adam <blueeskimo@gmx.net>
To: FreeBSD-Questions <freebsd-questions@freebsd.org>
Content-Type: text/plain
Message-Id: <1057511651.581.27.camel@elwood>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.0 
Date: 06 Jul 2003 13:14:12 -0400
Content-Transfer-Encoding: 7bit
Subject: More hardware problems (advice needed)
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Jul 2003 17:14:41 -0000

My main FreeBSD (4.8) box has died on me again, and I'm 99% certain it's
due to hardware failure. However, I'm having a very hard time
determining what hardware is going bad, due to the nature of the crash.

Let me describe the scenario.

I was working on the machine, not doing anything out of the ordinary.
All of a sudden, my mouse stopped responding. I thought maybe moused had
crashed, so I did 'ps -aux |fgrep moused'. This caused ps to segfault,
which caused me to nearly soil myself. So, I decided to quickly kill all
my apps and exit X so I could reboot. When I closed X, I noticed a lot
of errors on my console about dc0 (my Linksys NIC interface, external)
having underruns, and that ad2 was timed out. I also noticed that my LAN
connection to my other box was dead. I tried to reboot, and all went
well until it got to the 'Rebooting...', at which point it hung. I
waited for 10+ minutes, thinking it might eventually reboot, but it was
stuck, so I turned it off. 

When I powered back up, I got tons of errors that the kernel couldn't be
loaded, and I couldn't even get into single-user mode. So, I made a
fixit floppy and fired up the fixit shell, and start poking around to
see what happened. I was able to mount ad3 and ad2 just fine, but
mounting ad0 caused fixit to panic and the machine reboot. 

So, this is where I am now. For those of you that remember, I had
another crash & burn experience on that machine a couple months ago,
where the machine just suddenly froze completely and my ad0 was trashed
when I boot back up. That time, I didn't have backups. This time, I do.
But, before I work on that computer again, I think I need to replace
some hardware.

I've heard pretty good arguments for both the ad0 drive (Western Digital
120gb, 2mb cache), and for the motherboard/cpu (Asus A7V266-E, Athlon
1600+). I used memtest86 to test the RAM, which came up clean. 

I doubt if its a power problem, since I've got a very nice case (Antec
1080, 400+ watts). Also, I've got another machine in my apartment that
hasn't experienced any weird problems like this. 

The CPU might be overheating, but its hard to tell. Roughly 5 minutes
after the crash, I checked the CPU temperature from the BIOS, which
registered 63C for the CPU. I have no idea how hot the CPU was at the
time of the crash, but it definitely had to have cooled off a bit in
those 5 minutes.

I don't have enough $$ to replace all the hardware, so I'd like some
expert advice as to what is the most likely culprit. I don't know if
I'll be able to convince any of Asus, AMD, or Western Digital to give me
an RMA number, but I can try (also would like some advice on this to
maximize my chances).

Thanks,
Adam