Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Oct 1998 23:58:59 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        gibbs@plutotech.com (Justin T. Gibbs)
Cc:        tlambert@primenet.com, gibbs@plutotech.com, Don.Lewis@tsc.tdk.com, julian@whistle.com, freebsd-fs@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG
Subject:   Re: filesystem safety and SCSI disk write caching
Message-ID:  <199810132358.QAA18137@usr08.primenet.com>
In-Reply-To: <199810130705.BAA12205@pluto.plutotech.com> from "Justin T. Gibbs" at Oct 13, 98 00:59:05 am

next in thread | previous in thread | raw e-mail | index | archive | help
> >> Ask Terry since he has stated that he 'doesn't have any drives with
> >> non-bogus firmware'.
> >
> >A)	Run soft updates
> >B)	Press "reset" occasionally
> >C)	Note any anomalies in the resulting fsck when the machine
> >	comes back up
> >D)	if count < 200, goto B
> >E)	if # of anomalies > 0, print "bad firmware".
> 
> You're missing a large step here.  You can't prove that the 'anomaly'
> is related to the drive firmware without a trace of all transactions
> on the SCSI bus.  It could well be a missing dependency in the soft
> update code.

If I turn off write caching on the drive, and repeat the test, and
the evaluation (E) results in a "# of anomalies == 0", where with
write caching enabled, the number was > 0, then I can say with
high confidence that it's the write caching.

This is the experiment Don Lewis ran.


> I'd be more than happy to reproduce your failure scenario
> while recording a SCSI bus trace so that the fault is easy to interpret.
> Just send me any *modern* drive that you think fails.

Sure; just define "modern" for me, since my personal definition is
"not IDE".


> You should also ensure that your reset button does not cause any power
> spikes on the drive power lines.  That would be cheating.

It doesn't, since "# of anomalies == 0" with write caching disabled.


> >It's very hard to do this in software, without providing a mechanism
> >to actually break into the latency link between the drive reporting
> >a write cached operation has been written, and the actual writing.
> 
> If you can cause this a failure to occur by hitting your reset button, I
> should be able to cause it to occur by using a paper-clip if the reset
> condition (cased by the SCSI card BIOS in the reset button case) is the
> event that causes cache corruption.  Both are non-deterministic methods of
> error injection.

I'm not very confident that this will break in at the fragile point
in the transaction.

> >Such a latency link only exists on drives which Justin has identified
> >as having broken firmware due to the behaviour reported by Don Lewis.
> 
> I'm still unclear as to whether Don was turning off power or hitting what I
> consider the reset button.  His comment about UPSes use makes me think he
> was testing power outage scenarios.

Well, I know that this might sound insane, but we could ask Don, and
I could get out of the middle of this whole thing... ;-).


> >I would be much more interested in knowing what drives and firmware
> >revisions of those drives Justin has, since both mine and Don Lewis's
> >are demonstrably broken.
> 
> Since you were able to test 4 drives so quickly, I'd love to see well
> documented information on exactly how the file system was inconsistent
> in the failure cases.

There were directory dependencies which were committed out of order
(the modified fsck reports these as soft dependency errors...).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199810132358.QAA18137>