Date: Tue, 01 Mar 2005 09:58:38 +0100 From: Uwe Doering <gemini@geminix.org> To: Don Bowman <don@SANDVINE.com> Cc: freebsd-stable@freebsd.org Subject: Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails Message-ID: <42242EBE.1000906@geminix.org> In-Reply-To: <42242238.1060108@geminix.org> References: <2BCEB9A37A4D354AA276774EE13FB8C224D356@mailserver.sandvine.com> <42242238.1060108@geminix.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Uwe Doering wrote: > Don Bowman wrote: >> >> I have merged asr.c from RELENG_4 to get this fix: >> >> "Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following >> change wasn't included: >> - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP >> in case of a CHECK CONDITION." >> >> since I guess its conceivable this could cause my problem. > > I have to admit that I didn't think of this right away, even though I > was kind of involved. > > Did you merge 1.3.2.3 as well? This actually should have been one MFC > but it was done in two steps due to an oversight. Please let us know > whether the fix makes any difference in your case. Its author made it > for CD burners and wasn't sure whether it has any effect on other > devices, like da(4). Memory's coming back piecemeal. ;-) There's another thing you could try. The 'asr' driver's original timeout is 360 seconds, because its author knew that this type of controller can be busy for quite some time. FreeBSD's SCSI driver, however, sets it to its default of 60 seconds, which can be way too short. What happens when the controller is busy trying to deal with a failed disk is that the 'asr' driver sends a bus reset to the controller as a whole, due to the short timeout. You should be able to see this clash in the controller's event log. My feeling is that this collision of events may have ill effects, like the data corruption you've observed. On our machines we've set the SCSI timeout and thereby also the 'asr' driver's timeout back to the original 360 seconds, in order to leave the controller alone while it is busy. There is a 'sysctl' variable for this: kern.cam.da.default_timeout=360 Maybe that's the actual fix for your problem. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers gemini@geminix.org | http://www.escapebox.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?42242EBE.1000906>