From owner-freebsd-stable@FreeBSD.ORG Tue Mar 1 08:58:42 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 027A316A4CE for ; Tue, 1 Mar 2005 08:58:42 +0000 (GMT) Received: from gen129.n001.c02.escapebox.net (gen129.n001.c02.escapebox.net [213.73.91.129]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8AAD243D58 for ; Tue, 1 Mar 2005 08:58:41 +0000 (GMT) (envelope-from gemini@geminix.org) Message-ID: <42242EBE.1000906@geminix.org> Date: Tue, 01 Mar 2005 09:58:38 +0100 From: Uwe Doering Organization: Private UNIX Site User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.5) Gecko/20050130 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Don Bowman References: <2BCEB9A37A4D354AA276774EE13FB8C224D356@mailserver.sandvine.com> <42242238.1060108@geminix.org> In-Reply-To: <42242238.1060108@geminix.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Received: from gemini by geminix.org with asmtp (TLSv1:AES256-SHA:256) (Exim 3.36 #1) id 1D63DD-000BQ2-00; Tue, 01 Mar 2005 09:58:40 +0100 cc: freebsd-stable@freebsd.org Subject: Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Mar 2005 08:58:42 -0000 Uwe Doering wrote: > Don Bowman wrote: >> >> I have merged asr.c from RELENG_4 to get this fix: >> >> "Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following >> change wasn't included: >> - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP >> in case of a CHECK CONDITION." >> >> since I guess its conceivable this could cause my problem. > > I have to admit that I didn't think of this right away, even though I > was kind of involved. > > Did you merge 1.3.2.3 as well? This actually should have been one MFC > but it was done in two steps due to an oversight. Please let us know > whether the fix makes any difference in your case. Its author made it > for CD burners and wasn't sure whether it has any effect on other > devices, like da(4). Memory's coming back piecemeal. ;-) There's another thing you could try. The 'asr' driver's original timeout is 360 seconds, because its author knew that this type of controller can be busy for quite some time. FreeBSD's SCSI driver, however, sets it to its default of 60 seconds, which can be way too short. What happens when the controller is busy trying to deal with a failed disk is that the 'asr' driver sends a bus reset to the controller as a whole, due to the short timeout. You should be able to see this clash in the controller's event log. My feeling is that this collision of events may have ill effects, like the data corruption you've observed. On our machines we've set the SCSI timeout and thereby also the 'asr' driver's timeout back to the original 360 seconds, in order to leave the controller alone while it is busy. There is a 'sysctl' variable for this: kern.cam.da.default_timeout=360 Maybe that's the actual fix for your problem. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers gemini@geminix.org | http://www.escapebox.net