From owner-freebsd-hackers Tue Jun 25 02:00:56 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id CAA04509 for hackers-outgoing; Tue, 25 Jun 1996 02:00:56 -0700 (PDT) Received: from seagull.rtd.com (root@seagull.rtd.com [198.102.68.2]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id CAA04500 for ; Tue, 25 Jun 1996 02:00:52 -0700 (PDT) Received: (from dgy@localhost) by seagull.rtd.com (8.7.5/1.2) id CAA01330; Tue, 25 Jun 1996 02:00:05 -0700 (MST) From: Don Yuniskis Message-Id: <199606250900.CAA01330@seagull.rtd.com> Subject: Re: Memory tests ... To: msmith@atrad.adelaide.edu.au (Michael Smith) Date: Tue, 25 Jun 1996 02:00:04 -0700 (MST) Cc: hua@xenon.chromatic.com, dgy@rtd.com, jsigmon@www.hsc.wvu.edu, hackers@freebsd.org In-Reply-To: <199606250112.KAA24941@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Jun 25, 96 10:42:39 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > Ernest Hua stands accused of saying: > > > > I would prefer a thorough set of tests such as some reasonably optimized > > 1's and 0's test. I'm not familiar with algorithms for testing "flaky" > > versus "stuck". > > The problem is that no program can generate sequential accesses _fast_ > enough, and has no way of watching the critical timing parameters that > will help you decide _how_ marginal a given memory is. Agreed. > For this you need a _real_ memory tester, and because measuring nanosconds > accurately is difficult, thee cost _lots_ of money. > > So if you just want a 'does it work, yes/no' answer, put the memory into > your favorite high-performance OS (I prefer FreeBSD, OS/2 and Novell are > also popular), and thrash it mercilessly for a few days. I don't see the value of this -- except for the fact that it's "easy" to invoke from a shell :> If the system seizes up, it just tells you something died (most probably memory). You are counting on the failure to happen in such a way as to corrupt the state of the processor irrevocably. Exhaustive tests in *software* are usually ridiculous -- they take forever to execute and rarely detect anything but the grossest errors (i.e. stuck at * and decoding errors). These can be found through other (less painful) techniques. I find use of a LFSR with a long, "relatively prime" period to alternately fill and check memory contents is great as a quick POST-style check. It can also be used for more thorough testing (i.e. to catch thermal problems) if set in an endless loop. And, unlike just running a system hard for a while, it (usually) survives a memory failure and can report on the failure. Of course, this *doesn't* test other hardware that may be marginal, etc. (i.e. DMAC's). My two cents... --don