From owner-freebsd-scsi Fri Sep 18 12:05:42 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id MAA11751 for freebsd-scsi-outgoing; Fri, 18 Sep 1998 12:05:42 -0700 (PDT) (envelope-from owner-freebsd-scsi@FreeBSD.ORG) Received: from narnia.plutotech.com (narnia.plutotech.com [206.168.67.130]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id MAA11746 for ; Fri, 18 Sep 1998 12:05:41 -0700 (PDT) (envelope-from gibbs@narnia.plutotech.com) Received: (from gibbs@localhost) by narnia.plutotech.com (8.9.1/8.7.3) id MAA15386; Fri, 18 Sep 1998 12:58:53 -0600 (MDT) Date: Fri, 18 Sep 1998 12:58:53 -0600 (MDT) From: "Justin T. Gibbs" Message-Id: <199809181858.MAA15386@narnia.plutotech.com> To: mjacob@feral.com cc: scsi@FreeBSD.ORG Subject: Re: losing drives in a ccd array Newsgroups: pluto.freebsd.scsi In-Reply-To: User-Agent: tin/pre-1.4-971204 (UNIX) (FreeBSD/3.0-CURRENT (i386)) Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In article you wrote: > > Sounds like a power failure for that drive occurred. > > Justin, Ken- aren't you retrying disk operations? We are. Actually a couple things happened here and perhaps it show that CAM is trying to be just "too smart" in situations like this: 1) One of the drives in this system experienced a power spike or lossage. It may be that your power-supply for this enclosure is not quite up to snuff and craps out if you have enough seek activity on all the drives in the enclosure, but this is a secondary issue. 2) The aic7xxx driver saw the bus go free unexpectedly since it was talking to this device when it lost power. 3) The peripheral driver got an "unexpected bus free" error code which is a retry-able offsense. 4) The transaction was retried, but we got a selection timeout. 5) The peripheral driver interpreted the selection timeout as, "Crap the device has gone away". 6) We invalidated the pack, returned EIO for all pending buffers, and setup for device removal. There are some bugs here caused by missing support for the XPT_ABORT_CCB function code and that we don't necessarily wait for all CCBs to be returned to the peripheral driver before device removal, but that isn't too important for this discussion. 7) One of the last I/Os did finally get through and we see the enexpected UA. There are lots of problems (most already on my whiteboard): a single selection timeout shouldn't nuke a device, our policy of treating unexpected UAs as fatal may be too extreme considering the fact that we don't have even a reasonable scheme in place yet to determine if the device hasn't changed out from under us, XPT_ABORT_CCB really needs to be implemented before 3.0. We may want to completely disable the "selection timeout invalidates a device" before 3.0 until we have more discussion on how to deal with these kinds of errors. Luckily all of this is really easy to adjust in the CAM error recovery code. -- Justin To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message