From owner-freebsd-hackers  Tue Jun 25 02:00:56 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id CAA04509
          for hackers-outgoing; Tue, 25 Jun 1996 02:00:56 -0700 (PDT)
Received: from seagull.rtd.com (root@seagull.rtd.com [198.102.68.2])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id CAA04500
          for <hackers@freebsd.org>; Tue, 25 Jun 1996 02:00:52 -0700 (PDT)
Received: (from dgy@localhost) by seagull.rtd.com (8.7.5/1.2) id CAA01330; Tue, 25 Jun 1996 02:00:05 -0700 (MST)
From: Don Yuniskis <dgy@rtd.com>
Message-Id: <199606250900.CAA01330@seagull.rtd.com>
Subject: Re: Memory tests ...
To: msmith@atrad.adelaide.edu.au (Michael Smith)
Date: Tue, 25 Jun 1996 02:00:04 -0700 (MST)
Cc: hua@xenon.chromatic.com, dgy@rtd.com, jsigmon@www.hsc.wvu.edu,
        hackers@freebsd.org
In-Reply-To: <199606250112.KAA24941@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Jun 25, 96 10:42:39 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> Ernest Hua stands accused of saying:
> > 
> > I would prefer a thorough set of tests such as some reasonably optimized
> > 1's and 0's test.  I'm not familiar with algorithms for testing "flaky"
> > versus "stuck".
> 
> The problem is that no program can generate sequential accesses _fast_ 
> enough, and has no way of watching the critical timing parameters that
> will help you decide _how_ marginal a given memory is.

Agreed.
 
> For this you need a _real_ memory tester, and because measuring nanosconds
> accurately is difficult, thee cost _lots_ of money.
> 
> So if you just want a 'does it work, yes/no' answer, put the memory into
> your favorite high-performance OS (I prefer FreeBSD, OS/2 and Novell are 
> also popular), and thrash it mercilessly for a few days.

I don't see the value of this -- except for the fact that it's "easy"
to invoke from a shell  :>   If the system seizes up, it just tells
you something died (most probably memory).  You are counting on the
failure to happen in such a way as to corrupt the state of the
processor irrevocably.

Exhaustive tests in *software* are usually ridiculous -- they take
forever to execute and rarely detect anything but the grossest
errors (i.e. stuck at * and decoding errors).  These can be found
through other (less painful) techniques.

I find use of a LFSR with a long, "relatively prime" period to 
alternately fill and check memory contents is great as a quick
POST-style check.  It can also be used for more thorough testing
(i.e. to catch thermal problems) if set in an endless loop.  And,
unlike just running a system hard for a while, it (usually)
survives a memory failure and can report on the failure.

Of course, this *doesn't* test other hardware that may be marginal,
etc. (i.e. DMAC's).

My two cents...
--don