From owner-svn-src-head@FreeBSD.ORG Fri Jul 29 20:30:28 2011 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 852C7106566B; Fri, 29 Jul 2011 20:30:28 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id 69BF68FC1C; Fri, 29 Jul 2011 20:30:28 +0000 (UTC) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.4/8.14.4) with ESMTP id p6TKUSBT064897; Fri, 29 Jul 2011 20:30:28 GMT (envelope-from mav@svn.freebsd.org) Received: (from mav@localhost) by svn.freebsd.org (8.14.4/8.14.4/Submit) id p6TKUSaf064895; Fri, 29 Jul 2011 20:30:28 GMT (envelope-from mav@svn.freebsd.org) Message-Id: <201107292030.p6TKUSaf064895@svn.freebsd.org> From: Alexander Motin Date: Fri, 29 Jul 2011 20:30:28 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r224496 - head/sys/cam X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2011 20:30:28 -0000 Author: mav Date: Fri Jul 29 20:30:28 2011 New Revision: 224496 URL: http://svn.freebsd.org/changeset/base/224496 Log: In some cases failed SATA disks may report their presence, but don't respond to any commands. I've found that because of multiple command retries, each of which cause 30s timeout, bus reset and another retry or requeue for many commands, it may take ages to eventually drop the failed device. The odd thing is that those retries continue even after XPT considered device as dead and invalidated it. This patch makes cam_periph_error() to block any command retries after periph was marked as invalid. With that patch all activity completes in 1-2 minutes, just after several timeouts, required to consider device death. This should make ZFS, gmirror, graid, etc. operation more robust. Reviewed by: mjacob@ on scsi@ Approved by: re (kib) Modified: head/sys/cam/cam_periph.c Modified: head/sys/cam/cam_periph.c ============================================================================== --- head/sys/cam/cam_periph.c Fri Jul 29 20:24:04 2011 (r224495) +++ head/sys/cam/cam_periph.c Fri Jul 29 20:30:28 2011 (r224496) @@ -1550,7 +1550,8 @@ camperiphscsisenseerror(union ccb *ccb, * make sure we actually have retries available. */ if ((err_action & SSQ_DECREMENT_COUNT) != 0) { - if (ccb->ccb_h.retry_count > 0) + if (ccb->ccb_h.retry_count > 0 && + (periph->flags & CAM_PERIPH_INVALID) == 0) ccb->ccb_h.retry_count--; else { *action_string = "Retries exhausted"; @@ -1718,6 +1719,7 @@ int cam_periph_error(union ccb *ccb, cam_flags camflags, u_int32_t sense_flags, union ccb *save_ccb) { + struct cam_periph *periph; const char *action_string; cam_status status; int frozen; @@ -1725,7 +1727,8 @@ cam_periph_error(union ccb *ccb, cam_fla int openings; u_int32_t relsim_flags; u_int32_t timeout = 0; - + + periph = xpt_path_periph(ccb->ccb_h.path); action_string = NULL; status = ccb->ccb_h.status; frozen = (status & CAM_DEV_QFRZN) != 0; @@ -1787,9 +1790,9 @@ cam_periph_error(union ccb *ccb, cam_fla xpt_print(ccb->ccb_h.path, "Data overrun\n"); printed++; } - error = EIO; /* we have to kill the command */ /* decrement the number of retries */ - if (ccb->ccb_h.retry_count > 0) { + if (ccb->ccb_h.retry_count > 0 && + (periph->flags & CAM_PERIPH_INVALID) == 0) { ccb->ccb_h.retry_count--; error = ERESTART; } else { @@ -1808,7 +1811,8 @@ cam_periph_error(union ccb *ccb, cam_fla struct cam_path *newpath; if ((camflags & CAM_RETRY_SELTO) != 0) { - if (ccb->ccb_h.retry_count > 0) { + if (ccb->ccb_h.retry_count > 0 && + (periph->flags & CAM_PERIPH_INVALID) == 0) { ccb->ccb_h.retry_count--; error = ERESTART; @@ -1826,10 +1830,11 @@ cam_periph_error(union ccb *ccb, cam_fla timeout = periph_selto_delay; break; } + action_string = "Retries exhausted"; } error = ENXIO; /* Should we do more if we can't create the path?? */ - if (xpt_create_path(&newpath, xpt_path_periph(ccb->ccb_h.path), + if (xpt_create_path(&newpath, periph, xpt_path_path_id(ccb->ccb_h.path), xpt_path_target_id(ccb->ccb_h.path), CAM_LUN_WILDCARD) != CAM_REQ_CMP) @@ -1874,11 +1879,16 @@ cam_periph_error(union ccb *ccb, cam_fla /* FALLTHROUGH */ case CAM_REQUEUE_REQ: /* Unconditional requeue */ - error = ERESTART; if (bootverbose && printed == 0) { xpt_print(ccb->ccb_h.path, "Request requeued\n"); printed++; } + if ((periph->flags & CAM_PERIPH_INVALID) == 0) + error = ERESTART; + else { + action_string = "Retries exhausted"; + error = EIO; + } break; case CAM_RESRC_UNAVAIL: /* Wait a bit for the resource shortage to abate. */ @@ -1893,7 +1903,8 @@ cam_periph_error(union ccb *ccb, cam_fla /* FALLTHROUGH */ default: /* decrement the number of retries */ - if (ccb->ccb_h.retry_count > 0) { + if (ccb->ccb_h.retry_count > 0 && + (periph->flags & CAM_PERIPH_INVALID) == 0) { ccb->ccb_h.retry_count--; error = ERESTART; if (bootverbose && printed == 0) {