From owner-freebsd-scsi@FreeBSD.ORG Wed Aug 6 14:59:51 2003 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4DBED37B401 for ; Wed, 6 Aug 2003 14:59:51 -0700 (PDT) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8793043FA3 for ; Wed, 6 Aug 2003 14:59:50 -0700 (PDT) (envelope-from ddolson@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2653.19) id <305LHNZB>; Wed, 6 Aug 2003 17:59:50 -0400 Message-ID: From: Dave Dolson To: "'Justin T. Gibbs'" , "'freebsd-scsi@freebsd.org'" Date: Wed, 6 Aug 2003 17:59:46 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" Subject: RE: Swapping deadlock due to aic/scsi errors? X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Aug 2003 21:59:51 -0000 > > We have a reproducible bug characterized by the system > > becoming unresponsive (but db may be entered). > > System is based on FreeBSD 4.7 (i386) > > Using the aic79xx scsi driver. > > If you are using the stock aic79xx driver found in 4.7, I would > start by pulling in the latest 4.X aic79xx driver into your system. Yes, we are using the latest RELENG_4 driver. > > I would like to add some debugging to detect the lost command > > and possibly retry it. Can someone suggest where the lost > > command is supposed to be detected, and where the retry is > > supposed to occur. > > The "lost command" is supposed to be detected by the timeout > handler in the ahd driver. The timeout handler just forces > a bus reset which should cause the command to be returned to > the SCSI layer and then retried. It's not clear to me why > this might not be happening, but the ahd driver was relatively > green in 4.7 and you may just be tripping over a known (and > later corrected) bug manifesting itself in an unusual way. Are you referring to the timeout handler ahd_timeout() ? Are the commmands retried from ahd_reset_channel() ? (It looks more like they're simply aborted.) Aside: Am I correct in believing that ahd_execute_scb() is called for every command to the drive? David Dolson (ddolson@sandvine.com, www.sandvine.com)