From owner-freebsd-stable@FreeBSD.ORG Mon Oct 20 08:40:11 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CDD9216A4B3 for ; Mon, 20 Oct 2003 08:40:11 -0700 (PDT) Received: from dire.bris.ac.uk (dire.bris.ac.uk [137.222.10.60]) by mx1.FreeBSD.org (Postfix) with ESMTP id CBD3043F93 for ; Mon, 20 Oct 2003 08:40:10 -0700 (PDT) (envelope-from Jan.Grant@bristol.ac.uk) Received: from mail.ilrt.bris.ac.uk by dire.bris.ac.uk with SMTP-PRIV with ESMTP; Mon, 20 Oct 2003 16:39:28 +0100 Received: from cmjg (helo=localhost) by mail.ilrt.bris.ac.uk with local-esmtp (Exim 3.16 #1) id 1ABc6t-0007AU-00; Mon, 20 Oct 2003 16:38:19 +0100 Date: Mon, 20 Oct 2003 16:38:19 +0100 (BST) From: Jan Grant X-X-Sender: cmjg@mail.ilrt.bris.ac.uk To: stable@freebsd.org Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: Jan Grant Subject: Expert input required: P4 odd signals, no apparent memory fault, DISABLE_PSE? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Oct 2003 15:40:11 -0000 I'm tracking -STABLE on a 1.8GHz P4 with 512MB of memory. Roughly since the PAE changes were MFCed, I've been seeing memory-corruption-related errors under specific circumstances: for example, a run of portsdb -fUu can be guaranteed to generate SIGBUS, SIGILL and SIGSEGVs in a handful of sh, sed, etc. processes. However, reverting to a 4.8 kernel from prior to September either hides/masks these errors, or no longer triggers them. The memory/mobo _appears_ to check out OK under (ferinstance) extended runs of memtest86. Now, on -current I've seen reference to the DISABLE_PSE kernel option, and some discussion that this behaviour may be due to a processor/timing bug. So I have the following questions which I'd appreciate an expert giving a definitive opinion on (I'm no x86/hardware hacker, me): - are these problems potentially caused by this bug? - what exactly does DISABLE_PSE do? (it's undocumented and a one-para explanation of the expected behaviour of this option would be appreciated) - were any commits around the time of the MFC of the PAE code liable to have introduced problems into the kernel which this workaround might address? I know it's a lot to ask, but both hardware and OS have been rock-solid up until this point. Although I've not conclusively ruled out hardware faults, the continued stability under high load of a pre-september 4.8 kernel makes me suspicious that this is more likely to be a bug getting tickled than I'd normally suspect from these symptoms. I'm about to experiment with this option but it currently feels a little like cargo-cult admin. If there are any definitive tests that would indicate if this hardware problem is present and addressed by this, that's be nice to know too. Cheers, jan -- jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/ Tel +44(0)117 9287088 Fax +44 (0)117 9287112 http://ioctl.org/jan/ "No generalised law is without exception." A self-demonstrating axiom.