From owner-freebsd-questions@FreeBSD.ORG  Wed Jun 22 15:59:14 2005
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A790C16A41C
	for <freebsd-questions@freebsd.org>;
	Wed, 22 Jun 2005 15:59:14 +0000 (GMT) (envelope-from matt@atopia.net)
Received: from neptune.atopia.net (neptune.atopia.net [209.128.231.90])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7CADE43D5C
	for <freebsd-questions@freebsd.org>;
	Wed, 22 Jun 2005 15:59:14 +0000 (GMT) (envelope-from matt@atopia.net)
Received: from [192.168.0.102] (pcp173257pcs.plsntv01.nj.comcast.net
	[68.46.70.16])
	by neptune.atopia.net (Postfix) with ESMTP id 416E840B4;
	Wed, 22 Jun 2005 11:59:13 -0400 (EDT)
Message-ID: <42B98AD0.7080508@atopia.net>
Date: Wed, 22 Jun 2005 11:59:12 -0400
From: Matt Juszczak <matt@atopia.net>
User-Agent: Mozilla Thunderbird 0.9 (X11/20041129)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Ted Mittelstaedt <tedm@toybox.placo.com>
References: <LOBBIFDAGNMAMLGJJCKNGEMKFBAA.tedm@toybox.placo.com>
In-Reply-To: <LOBBIFDAGNMAMLGJJCKNGEMKFBAA.tedm@toybox.placo.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-questions@freebsd.org
Subject: Re: FreeBSD Machines dieing, we've tried so much....
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Jun 2005 15:59:14 -0000


>The vast majority of panics are hardware-related.  It is rare nowadays
>for a usermode program to make the system panic.  In particular you said
>the problem happens more under load.  That really points even more to a
>hardware problem - bad CPU cache ram, bad ram, scsi termination, that
>sort of thing.
>
>Ted
>  
>

This is kind of going to be a blanket post to all the recent suggestions 
to me.  I appreciate suggestions :)   Ted, sorry, my other posts had 
dmesg and hardware specs, etc. I just couldn't remember the subject line 
of that thread. I'll be more descriptive here.

We have two different servers crashing.  Both are SMP, but on different 
hardware.  We have five freeBSD servers in total, and only two are 
affected.  That is why I do not believe this is a hardware problem.

In any case, the machines are in a cold room where the temperature is 
constantly maintained.  20 other servers in there are perfectly stable, 
with no probs.

This particular machine that crashed last night while running portsdb 
-uU is a Super Micro machine, with hyperthreading disabled in the bios, 
dual CPU 3.06 ghz, with 4 gigs memory.  We ran mem test on orion (the 
machine that crashed last night) a week or so ago, and it found 70,000 
ECC errors.  Those were fixed and that machine has been stable until 
last night.  I've now disabled SMP support, we'll see if that keeps it 
stable or not. Portsdb -uU ran without problems after I disabled SMP.

As far as uranus, the other box (we keep a planet scheme for a certain 
set of servers), we ran memtest86 and found no errors at all.  That box 
crashed about two days ago but has been stable since.  It has not lasted 
more than a week without doing a kernel trap and freezing.

It seems that both these servers have this problem.  Out of the five 
FreeBSD servers we have, these two are the ones with the highest load.  
Maybe a higher load on the other three servers would cause the same 
problem.  I agree with you that this is a hardware problem, but on more 
than one server with two different architectures and our highest load 
makes me re-consider.

If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is something 
that has been fixed in -stable?  I will compile a debug kernel today and 
try to provide a trace to the problem.  I'll do it on which ever server 
crashes next.