From owner-freebsd-fs  Tue Oct 13 00:06:13 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id AAA11466
          for freebsd-fs-outgoing; Tue, 13 Oct 1998 00:06:13 -0700 (PDT)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from pluto.plutotech.com (mail.plutotech.com [206.168.67.137])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id AAA11449;
          Tue, 13 Oct 1998 00:06:09 -0700 (PDT)
          (envelope-from gibbs@plutotech.com)
Received: from narnia.plutotech.com (narnia.plutotech.com [206.168.67.130])
	by pluto.plutotech.com (8.8.7/8.8.5) with ESMTP id BAA12205;
	Tue, 13 Oct 1998 01:05:50 -0600 (MDT)
Message-Id: <199810130705.BAA12205@pluto.plutotech.com>
X-Mailer: exmh version 2.0.2 2/24/98
To: Terry Lambert <tlambert@primenet.com>
cc: gibbs@plutotech.com (Justin T. Gibbs), Don.Lewis@tsc.tdk.com,
        julian@whistle.com, freebsd-fs@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG
Subject: Re: filesystem safety and SCSI disk write caching 
In-reply-to: Your message of "Mon, 12 Oct 1998 22:58:15 -0000."
             <199810122258.PAA11377@usr02.primenet.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 13 Oct 1998 00:59:05 -0600
From: "Justin T. Gibbs" <gibbs@plutotech.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

>> >} 2) Use a drive with non-bogus firmware.  Recent Seagate and IBM
>> >} drives should work just fine.  I haven't validated any Quantum
>> >} drives in this regard yet.
>> >
>> >But how can tell if the firmware is non-bogus?
>> 
>> Ask Terry since he has stated that he 'doesn't have any drives with
>> non-bogus firmware'.
>
>A)	Run soft updates
>B)	Press "reset" occasionally
>C)	Note any anomalies in the resulting fsck when the machine
>	comes back up
>D)	if count < 200, goto B
>E)	if # of anomalies > 0, print "bad firmware".

You're missing a large step here.  You can't prove that the 'anomaly'
is related to the drive firmware without a trace of all transactions
on the SCSI bus.  It could well be a missing dependency in the soft
update code.  I'd be more than happy to reproduce your failure scenario
while recording a SCSI bus trace so that the fault is easy to interpret.
Just send me any *modern* drive that you think fails.

You should also ensure that your reset button does not cause any power
spikes on the drive power lines.  That would be cheating.

>It's very hard to do this in software, without providing a mechanism
>to actually break into the latency link between the drive reporting
>a write cached operation has been written, and the actual writing.

If you can cause this a failure to occur by hitting your reset button, I
should be able to cause it to occur by using a paper-clip if the reset
condition (cased by the SCSI card BIOS in the reset button case) is the
event that causes cache corruption.  Both are non-deterministic methods of
error injection.

>Such a latency link only exists on drives which Justin has identified
>as having broken firmware due to the behaviour reported by Don Lewis.

I'm still unclear as to whether Don was turning off power or hitting what I
consider the reset button.  His comment about UPSes use makes me think he
was testing power outage scenarios.

>I would be much more interested in knowing what drives and firmware
>revisions of those drives Justin has, since both mine and Don Lewis's
>are demonstrably broken.

Since you were able to test 4 drives so quickly, I'd love to see well
documented information on exactly how the file system was inconsistent
in the failure cases.

--
Justin


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message