Date: Wed, 22 Jun 2005 10:12:04 -0600 From: "Chad Leigh -- Shire.Net LLC" <chad@shire.net> To: Matt Juszczak <matt@atopia.net> Cc: freebsd-questions questions <freebsd-questions@freebsd.org> Subject: Re: FreeBSD Machines dieing, we've tried so much.... Message-ID: <41AD7E3D-E59C-4AAF-803F-11048A005D44@shire.net> In-Reply-To: <42B98AD0.7080508@atopia.net> References: <LOBBIFDAGNMAMLGJJCKNGEMKFBAA.tedm@toybox.placo.com> <42B98AD0.7080508@atopia.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 22, 2005, at 9:59 AM, Matt Juszczak wrote: > > >> The vast majority of panics are hardware-related. It is rare >> nowadays >> for a usermode program to make the system panic. In particular >> you said >> the problem happens more under load. That really points even more >> to a >> hardware problem - bad CPU cache ram, bad ram, scsi termination, that >> sort of thing. >> >> Ted >> >> > > This is kind of going to be a blanket post to all the recent > suggestions to me. I appreciate suggestions :) Ted, sorry, my > other posts had dmesg and hardware specs, etc. I just couldn't > remember the subject line of that thread. I'll be more descriptive > here. > > We have two different servers crashing. Both are SMP, but on > different hardware. We have five freeBSD servers in total, and > only two are affected. That is why I do not believe this is a > hardware problem. > > In any case, the machines are in a cold room where the temperature > is constantly maintained. 20 other servers in there are perfectly > stable, with no probs. > > This particular machine that crashed last night while running > portsdb -uU is a Super Micro machine, with hyperthreading disabled > in the bios, dual CPU 3.06 ghz, with 4 gigs memory. We ran mem > test on orion (the machine that crashed last night) a week or so > ago, and it found 70,000 ECC errors. Those were fixed and that > machine has been stable until last night. I've now disabled SMP > support, we'll see if that keeps it stable or not. Portsdb -uU ran > without problems after I disabled SMP. > > As far as uranus, the other box (we keep a planet scheme for a > certain set of servers), we ran memtest86 and found no errors at > all. That box crashed about two days ago but has been stable > since. It has not lasted more than a week without doing a kernel > trap and freezing. > > It seems that both these servers have this problem. Out of the > five FreeBSD servers we have, these two are the ones with the > highest load. Maybe a higher load on the other three servers would > cause the same problem. I agree with you that this is a hardware > problem, but on more than one server with two different > architectures and our highest load makes me re-consider. > > If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is > something that has been fixed in -stable? I will compile a debug > kernel today and try to provide a trace to the problem. I'll do it > on which ever server crashes next. What do they have in common? Disk controller? Network controller? Chad --- Chad Leigh -- Shire.Net LLC Your Web App and Email hosting provider chad@shire.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?41AD7E3D-E59C-4AAF-803F-11048A005D44>