Date: Tue, 13 Oct 1998 23:58:59 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: gibbs@plutotech.com (Justin T. Gibbs) Cc: tlambert@primenet.com, gibbs@plutotech.com, Don.Lewis@tsc.tdk.com, julian@whistle.com, freebsd-fs@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG Subject: Re: filesystem safety and SCSI disk write caching Message-ID: <199810132358.QAA18137@usr08.primenet.com> In-Reply-To: <199810130705.BAA12205@pluto.plutotech.com> from "Justin T. Gibbs" at Oct 13, 98 00:59:05 am
next in thread | previous in thread | raw e-mail | index | archive | help
> >> Ask Terry since he has stated that he 'doesn't have any drives with > >> non-bogus firmware'. > > > >A) Run soft updates > >B) Press "reset" occasionally > >C) Note any anomalies in the resulting fsck when the machine > > comes back up > >D) if count < 200, goto B > >E) if # of anomalies > 0, print "bad firmware". > > You're missing a large step here. You can't prove that the 'anomaly' > is related to the drive firmware without a trace of all transactions > on the SCSI bus. It could well be a missing dependency in the soft > update code. If I turn off write caching on the drive, and repeat the test, and the evaluation (E) results in a "# of anomalies == 0", where with write caching enabled, the number was > 0, then I can say with high confidence that it's the write caching. This is the experiment Don Lewis ran. > I'd be more than happy to reproduce your failure scenario > while recording a SCSI bus trace so that the fault is easy to interpret. > Just send me any *modern* drive that you think fails. Sure; just define "modern" for me, since my personal definition is "not IDE". > You should also ensure that your reset button does not cause any power > spikes on the drive power lines. That would be cheating. It doesn't, since "# of anomalies == 0" with write caching disabled. > >It's very hard to do this in software, without providing a mechanism > >to actually break into the latency link between the drive reporting > >a write cached operation has been written, and the actual writing. > > If you can cause this a failure to occur by hitting your reset button, I > should be able to cause it to occur by using a paper-clip if the reset > condition (cased by the SCSI card BIOS in the reset button case) is the > event that causes cache corruption. Both are non-deterministic methods of > error injection. I'm not very confident that this will break in at the fragile point in the transaction. > >Such a latency link only exists on drives which Justin has identified > >as having broken firmware due to the behaviour reported by Don Lewis. > > I'm still unclear as to whether Don was turning off power or hitting what I > consider the reset button. His comment about UPSes use makes me think he > was testing power outage scenarios. Well, I know that this might sound insane, but we could ask Don, and I could get out of the middle of this whole thing... ;-). > >I would be much more interested in knowing what drives and firmware > >revisions of those drives Justin has, since both mine and Don Lewis's > >are demonstrably broken. > > Since you were able to test 4 drives so quickly, I'd love to see well > documented information on exactly how the file system was inconsistent > in the failure cases. There were directory dependencies which were committed out of order (the modified fsck reports these as soft dependency errors...). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199810132358.QAA18137>