From owner-freebsd-questions@FreeBSD.ORG  Wed Jul 16 13:15:52 2003
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5FBA937B401
	for <freebsd-questions@freebsd.org>;
	Wed, 16 Jul 2003 13:15:52 -0700 (PDT)
Received: from wrongcrowd.com (dsl231-036-178.sea1.dsl.speakeasy.net
	[216.231.36.178])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 98BE743F93
	for <freebsd-questions@freebsd.org>;
	Wed, 16 Jul 2003 13:15:49 -0700 (PDT)
	(envelope-from matt@wrongcrowd.com)
Received: from [192.168.1.99] (helo=thunderbird.wrongcrowd.com)
	by wrongcrowd.com with esmtp (Exim 3.34 #1)
	id 19csgj-0003ip-00
	for freebsd-questions@freebsd.org; Wed, 16 Jul 2003 13:15:45 -0700
Message-Id: <5.2.0.9.2.20030716124813.035e9e68@192.168.1.1>
X-Sender: matt@192.168.1.1
X-Mailer: QUALCOMM Windows Eudora Version 5.2.0.9
Date: Wed, 16 Jul 2003 13:16:26 -0700
To: freebsd-questions@freebsd.org
From: Matt Staroscik <matt@wrongcrowd.com>
In-Reply-To: <20030716054801.7A07737B404@hub.freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Subject: Re: Adaptec 2400A RAID controller corrupting data (4.8)
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Jul 2003 20:15:52 -0000

I am going to break this saga into 2 posts, one with the ugly details for 
those who are interested, and one short post with the essential questions 
and observations.

>I have that card with 6 60 gig drives and set the box up (freebsd 4.7?)
>and it would run for a day or so and just crash.  I also recall having
>similar panics when moving large amounts of data.  I've given up on
>using the box for any real work so it's just sitting doing nothing
>waiting... hoping for a solution... a glimmer of hope.  ;-)
>
>If you get it working please post.

Here is an update. While I have made progress I am not 100% hopeful for a 
solution that is stable in the long term.

To make a long story short, I seem to have made the system much stable by 
turning off soft updates. I was able to do a make buildworld, and then 
delete the contents of /usr/obj. Previously, one of those actions was sure 
to trigger a panic. Before I tried disabling soft updates I also did all 
this, some of which I readily admit is voodoo:

- cable replacement
- jumped drives to Master instead of Cable Select
- Changed RAID card PCI slot
- Wiggled everything

I continued my test by cvsupping my source and doing another make 
buildworld. However, this time it bombed out while working on groff. I 
checked the file in an editor and it didn't look munged, so I am not sure 
if there is an error in the cvs tree, an innocent file transfer error, or a 
sign of deeper issues with my disk subsystem. I am going to thrash the 
machine with more builds but avoid CVS for now.

Unfortunately, turning off soft updates isn't a great solution, if indeed 
it IS a solution, which I am still testing. It definitely makes things 
slower. My buildworld went from about 23 minutes to 34 minutes this way. 
Removing the contents of /usr/obj took about 1 minute, whereas with soft 
updates it took only a few seconds (though it panicked afterwards).

Update: I created a custom kernel config (adding only device pcm and 
removing nothing) and successfully built it. I then installed it, rebooted, 
and tried to make installworld. Bomb city! getty dumped core before I even 
logged in and it got worse from there.

Then I tried deleting /usr/obj and I got the kernel panic again. :)

Observation: My last 2 panics (ffs_blkfree) reported these block numbers: 
54608, 54592. Those are awfully close. Could my trouble stem from a defect 
on a disk?

Things I have yet to try:

- Removing the Maxtor 160s from the RAID and trying them individually on 
the motherboard controller.
- Applying a hammer to the system

Cheers,
Matt