From owner-freebsd-questions@FreeBSD.ORG Wed Jun 22 16:12:09 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EFB7816A41C for ; Wed, 22 Jun 2005 16:12:08 +0000 (GMT) (envelope-from chad@shire.net) Received: from hobbiton.shire.net (hobbiton.shire.net [166.70.252.250]) by mx1.FreeBSD.org (Postfix) with ESMTP id C1BE743D55 for ; Wed, 22 Jun 2005 16:12:08 +0000 (GMT) (envelope-from chad@shire.net) Received: from [67.161.222.227] (helo=[192.168.99.68]) by hobbiton.shire.net with esmtpa (Exim 4.51) id 1Dl7pd-00099Y-Gd; Wed, 22 Jun 2005 10:12:07 -0600 In-Reply-To: <42B98AD0.7080508@atopia.net> References: <42B98AD0.7080508@atopia.net> Mime-Version: 1.0 (Apple Message framework v730) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <41AD7E3D-E59C-4AAF-803F-11048A005D44@shire.net> Content-Transfer-Encoding: 7bit From: "Chad Leigh -- Shire.Net LLC" Date: Wed, 22 Jun 2005 10:12:04 -0600 To: Matt Juszczak X-Mailer: Apple Mail (2.730) X-SA-Exim-Connect-IP: 67.161.222.227 X-SA-Exim-Mail-From: chad@shire.net X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on hobbiton.shire.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50, GREYLIST_ISWHITE autolearn=disabled version=3.0.3 X-SA-Exim-Version: 4.2 (built Mon May 30 00:43:02 MDT 2005) X-SA-Exim-Scanned: Yes (on hobbiton.shire.net) Cc: freebsd-questions questions Subject: Re: FreeBSD Machines dieing, we've tried so much.... X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jun 2005 16:12:09 -0000 On Jun 22, 2005, at 9:59 AM, Matt Juszczak wrote: > > >> The vast majority of panics are hardware-related. It is rare >> nowadays >> for a usermode program to make the system panic. In particular >> you said >> the problem happens more under load. That really points even more >> to a >> hardware problem - bad CPU cache ram, bad ram, scsi termination, that >> sort of thing. >> >> Ted >> >> > > This is kind of going to be a blanket post to all the recent > suggestions to me. I appreciate suggestions :) Ted, sorry, my > other posts had dmesg and hardware specs, etc. I just couldn't > remember the subject line of that thread. I'll be more descriptive > here. > > We have two different servers crashing. Both are SMP, but on > different hardware. We have five freeBSD servers in total, and > only two are affected. That is why I do not believe this is a > hardware problem. > > In any case, the machines are in a cold room where the temperature > is constantly maintained. 20 other servers in there are perfectly > stable, with no probs. > > This particular machine that crashed last night while running > portsdb -uU is a Super Micro machine, with hyperthreading disabled > in the bios, dual CPU 3.06 ghz, with 4 gigs memory. We ran mem > test on orion (the machine that crashed last night) a week or so > ago, and it found 70,000 ECC errors. Those were fixed and that > machine has been stable until last night. I've now disabled SMP > support, we'll see if that keeps it stable or not. Portsdb -uU ran > without problems after I disabled SMP. > > As far as uranus, the other box (we keep a planet scheme for a > certain set of servers), we ran memtest86 and found no errors at > all. That box crashed about two days ago but has been stable > since. It has not lasted more than a week without doing a kernel > trap and freezing. > > It seems that both these servers have this problem. Out of the > five FreeBSD servers we have, these two are the ones with the > highest load. Maybe a higher load on the other three servers would > cause the same problem. I agree with you that this is a hardware > problem, but on more than one server with two different > architectures and our highest load makes me re-consider. > > If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is > something that has been fixed in -stable? I will compile a debug > kernel today and try to provide a trace to the problem. I'll do it > on which ever server crashes next. What do they have in common? Disk controller? Network controller? Chad --- Chad Leigh -- Shire.Net LLC Your Web App and Email hosting provider chad@shire.net