From owner-freebsd-stable@FreeBSD.ORG Tue Jan 18 11:01:23 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3265F16A4D8 for ; Tue, 18 Jan 2005 11:01:23 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id A9C2043D6B for ; Tue, 18 Jan 2005 11:00:11 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.13.1/8.13.1) with ESMTP id j0IB06pY029329; Tue, 18 Jan 2005 06:00:06 -0500 (EST) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)j0IAxqaT029309; Tue, 18 Jan 2005 11:00:06 GMT (envelope-from robert@fledge.watson.org) Date: Tue, 18 Jan 2005 10:59:52 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Vivek Khera In-Reply-To: <557348B4-6906-11D9-B522-000A95D14982@khera.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-stable@freebsd.org Subject: Re: 5.3-RELEASE crashes during make buildworld (and other problems) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Jan 2005 11:01:23 -0000 On Mon, 17 Jan 2005, Vivek Khera wrote: > On Jan 13, 2005, at 4:46 AM, Peter Jeremy wrote: > > > That doesn't totally rule out hardware. Pattern-sensitive memory > > problems may not show up on different operating systems (or even > > different kernels). That said, based on the trap information, I'd > > look at a software cause first. > > Indeed. I once had a box that would run Linux 100% stable under any > load for months on end, but with BSD/OS it would crap out (random > processes fail) after a max of 3 weeks requiring a reboot. > > Never rule out bad hardware, especially with PC crap. Even minor OS revisions can reveal or hide memory problems. For example, for quite a while one of my Pentium (1!) server boxes had a single bit error (a stuck on bit) that fell into a section of memory that always held pinned kernel pages, and in particular, ended up holding a fairly obscure kernel code branch in a module that was loaded. Then one day kernel memory layout got chaged a bit, and the page ended up being paged into user memory, resulting in frequent application segfaults and data corruption. I was sure it was the OS upgrade, since backing out to the previous kernel/modules fixed it reliably ... until I ran a memory test and figured out what was actually happening. It was pretty frustrating to try to debug, and reinforces the conclusion that doing a bit of legwork on a badly behaving system to confirm it's not a hardware fault that can be easily ruled out can go a long way. Which isn't to say that the problem in this thread is hardware, but you don't want to spend two weeks tracking a kernel bug to find out that swapping out the memory with a seemingly identical DIMM fixes it. Checking ethernet cabling and link negotiation, a decent memory test run, checking SCSI termination, checking ATA cable type, etc, as first steps to debugging a problem that would have similar symptoms is a good strt. Oh, and if it's your parents calling on the phone at 6:30am with a printer problem, the first thing to ask is whether their printer is plugged in. :-) Robert N M Watson