Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Sep 1998 12:58:53 -0600 (MDT)
From:      "Justin T. Gibbs" <gibbs@narnia.plutotech.com>
To:        mjacob@feral.com
Cc:        scsi@FreeBSD.ORG
Subject:   Re: losing drives in a ccd array
Message-ID:  <199809181858.MAA15386@narnia.plutotech.com>
In-Reply-To: <Pine.LNX.4.02.9809181001120.4991-100000@feral-gw>

next in thread | previous in thread | raw e-mail | index | archive | help
In article <Pine.LNX.4.02.9809181001120.4991-100000@feral-gw> you wrote:
> 
> Sounds like a power failure for that drive occurred.
> 
> Justin, Ken- aren't you retrying disk operations?

We are.  Actually a couple things happened here and perhaps it show
that CAM is trying to be just "too smart" in situations like this:

1) One of the drives in this system experienced a power spike or
   lossage.  It may be that your power-supply for this enclosure
   is not quite up to snuff and craps out if you have enough seek
   activity on all the drives in the enclosure, but this is a secondary
   issue.

2) The aic7xxx driver saw the bus go free unexpectedly since it
   was talking to this device when it lost power.

3) The peripheral driver got an "unexpected bus free" error code
   which is a retry-able offsense.

4) The transaction was retried, but we got a selection timeout.

5) The peripheral driver interpreted the selection timeout as,
   "Crap the device has gone away".

6) We invalidated the pack, returned EIO for all pending buffers,
   and setup for device removal.  There are some bugs here caused
   by missing support for the XPT_ABORT_CCB function code and that
   we don't necessarily wait for all CCBs to be returned to the
   peripheral driver before device removal, but that isn't too
   important for this discussion.

7) One of the last I/Os did finally get through and we see the enexpected
   UA.
   
There are lots of problems (most already on my whiteboard): a single
selection timeout shouldn't nuke a device, our policy of treating
unexpected UAs as fatal may be too extreme considering the fact that
we don't have even a reasonable scheme in place yet to determine if
the device hasn't changed out from under us,  XPT_ABORT_CCB really
needs to be implemented before 3.0. 

We may want to completely disable the "selection timeout invalidates
a device" before 3.0 until we have more discussion on how to deal with
these kinds of errors.  Luckily all of this is really easy to adjust
in the CAM error recovery code.

--
Justin

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809181858.MAA15386>