From owner-freebsd-questions@FreeBSD.ORG Wed Jun 22 15:59:14 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A790C16A41C for ; Wed, 22 Jun 2005 15:59:14 +0000 (GMT) (envelope-from matt@atopia.net) Received: from neptune.atopia.net (neptune.atopia.net [209.128.231.90]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7CADE43D5C for ; Wed, 22 Jun 2005 15:59:14 +0000 (GMT) (envelope-from matt@atopia.net) Received: from [192.168.0.102] (pcp173257pcs.plsntv01.nj.comcast.net [68.46.70.16]) by neptune.atopia.net (Postfix) with ESMTP id 416E840B4; Wed, 22 Jun 2005 11:59:13 -0400 (EDT) Message-ID: <42B98AD0.7080508@atopia.net> Date: Wed, 22 Jun 2005 11:59:12 -0400 From: Matt Juszczak User-Agent: Mozilla Thunderbird 0.9 (X11/20041129) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Ted Mittelstaedt References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-questions@freebsd.org Subject: Re: FreeBSD Machines dieing, we've tried so much.... X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Jun 2005 15:59:14 -0000 >The vast majority of panics are hardware-related. It is rare nowadays >for a usermode program to make the system panic. In particular you said >the problem happens more under load. That really points even more to a >hardware problem - bad CPU cache ram, bad ram, scsi termination, that >sort of thing. > >Ted > > This is kind of going to be a blanket post to all the recent suggestions to me. I appreciate suggestions :) Ted, sorry, my other posts had dmesg and hardware specs, etc. I just couldn't remember the subject line of that thread. I'll be more descriptive here. We have two different servers crashing. Both are SMP, but on different hardware. We have five freeBSD servers in total, and only two are affected. That is why I do not believe this is a hardware problem. In any case, the machines are in a cold room where the temperature is constantly maintained. 20 other servers in there are perfectly stable, with no probs. This particular machine that crashed last night while running portsdb -uU is a Super Micro machine, with hyperthreading disabled in the bios, dual CPU 3.06 ghz, with 4 gigs memory. We ran mem test on orion (the machine that crashed last night) a week or so ago, and it found 70,000 ECC errors. Those were fixed and that machine has been stable until last night. I've now disabled SMP support, we'll see if that keeps it stable or not. Portsdb -uU ran without problems after I disabled SMP. As far as uranus, the other box (we keep a planet scheme for a certain set of servers), we ran memtest86 and found no errors at all. That box crashed about two days ago but has been stable since. It has not lasted more than a week without doing a kernel trap and freezing. It seems that both these servers have this problem. Out of the five FreeBSD servers we have, these two are the ones with the highest load. Maybe a higher load on the other three servers would cause the same problem. I agree with you that this is a hardware problem, but on more than one server with two different architectures and our highest load makes me re-consider. If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is something that has been fixed in -stable? I will compile a debug kernel today and try to provide a trace to the problem. I'll do it on which ever server crashes next.