From owner-freebsd-scsi@FreeBSD.ORG  Sun Jun  1 09:27:23 2003
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 564BF37B401
	for <freebsd-scsi@freebsd.org>; Sun,  1 Jun 2003 09:27:23 -0700 (PDT)
Received: from hub.org (hub.org [64.117.225.220])
	by mx1.FreeBSD.org (Postfix) with ESMTP id CED2743FA3
	for <freebsd-scsi@freebsd.org>; Sun,  1 Jun 2003 09:27:22 -0700 (PDT)
	(envelope-from scrappy@hub.org)
Received: from hub.org (unknown [64.117.225.220])
	by hub.org (Postfix) with ESMTP
	id 1B6AA6BA75E; Sun,  1 Jun 2003 13:27:21 -0300 (ADT)
Date: Sun, 1 Jun 2003 13:27:21 -0300 (ADT)
From: "Marc G. Fournier" <scrappy@hub.org>
To: freebsd-scsi@freebsd.org
Message-ID: <20030601131404.P6572@hub.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: Scott Long <scott_long@btc.adaptec.com>
Subject: Critical bug in Adaptec(aac) driver ...
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Jun 2003 16:27:23 -0000


As those on this list will have seen over the past few months, I have a
server that had (past tense) an Adaptec 2120s controller in her that was
giving alot of grief ... about 3 weeks ago, the server it was in *really*
blew up ... one drive was reported as down (in a RAID5 array), and when we
tried to bring it back up, a second drive started to "fail" ... I got the
techs to shut her down, and literally rushed to the remote location to see
if there was anything that I could do to at least recover the data ...

When I got there to bring it back up, the server reported that a 3rd drive
had failed ... and within a few hours, a 4th drive failed ... the result
being that we lost all of the data on that server, which turned out to be
quite painful to recover ...

While down there, we replaced the Adaptec controller with an Intel one,
reformatted the exact same drives, in the exact same chassis, and she's
been running fine since ...

On my trip back, I had a chat with a friend that does development work in
the Linux world, and who had had that server previous to myself, and
apparently there is a "known bug" in Linux that he says sounds exactly
like what I experienced (they hit it right in the middle of developing on
that box) and that there are apparently two Linux kernel patches that they
had to apply (after rebuilding from scratch) to correct the problem ...

The way he explained the problem to me, he made it sound like the kernel
driver was interacting with the BIOs and causing some corruption ... not
sure at what level, but since trying to swap in a new controller didn't
restore things, I'm suspecting at the hard drive level ... ?

Scott, while down there, I tried just about everything I could think to
... we replaced the SCSI cable, put the drives/controller into a second
identical chassis, swap host controller cards themselves (I had brought
spares) ... and that server, as I mentioned, is currently running quite
happily with an Intel host controller in it :(  So, unless the same
"failure" was hitting two host controllers, hardware failure doesn't seem
to have been the cause ...