From owner-freebsd-scsi@FreeBSD.ORG Sun Jun 1 09:27:23 2003 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 564BF37B401 for ; Sun, 1 Jun 2003 09:27:23 -0700 (PDT) Received: from hub.org (hub.org [64.117.225.220]) by mx1.FreeBSD.org (Postfix) with ESMTP id CED2743FA3 for ; Sun, 1 Jun 2003 09:27:22 -0700 (PDT) (envelope-from scrappy@hub.org) Received: from hub.org (unknown [64.117.225.220]) by hub.org (Postfix) with ESMTP id 1B6AA6BA75E; Sun, 1 Jun 2003 13:27:21 -0300 (ADT) Date: Sun, 1 Jun 2003 13:27:21 -0300 (ADT) From: "Marc G. Fournier" To: freebsd-scsi@freebsd.org Message-ID: <20030601131404.P6572@hub.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Scott Long Subject: Critical bug in Adaptec(aac) driver ... X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 01 Jun 2003 16:27:23 -0000 As those on this list will have seen over the past few months, I have a server that had (past tense) an Adaptec 2120s controller in her that was giving alot of grief ... about 3 weeks ago, the server it was in *really* blew up ... one drive was reported as down (in a RAID5 array), and when we tried to bring it back up, a second drive started to "fail" ... I got the techs to shut her down, and literally rushed to the remote location to see if there was anything that I could do to at least recover the data ... When I got there to bring it back up, the server reported that a 3rd drive had failed ... and within a few hours, a 4th drive failed ... the result being that we lost all of the data on that server, which turned out to be quite painful to recover ... While down there, we replaced the Adaptec controller with an Intel one, reformatted the exact same drives, in the exact same chassis, and she's been running fine since ... On my trip back, I had a chat with a friend that does development work in the Linux world, and who had had that server previous to myself, and apparently there is a "known bug" in Linux that he says sounds exactly like what I experienced (they hit it right in the middle of developing on that box) and that there are apparently two Linux kernel patches that they had to apply (after rebuilding from scratch) to correct the problem ... The way he explained the problem to me, he made it sound like the kernel driver was interacting with the BIOs and causing some corruption ... not sure at what level, but since trying to swap in a new controller didn't restore things, I'm suspecting at the hard drive level ... ? Scott, while down there, I tried just about everything I could think to ... we replaced the SCSI cable, put the drives/controller into a second identical chassis, swap host controller cards themselves (I had brought spares) ... and that server, as I mentioned, is currently running quite happily with an Intel host controller in it :( So, unless the same "failure" was hitting two host controllers, hardware failure doesn't seem to have been the cause ...