From owner-freebsd-stable  Tue Sep 19  2:26:45 2000
Delivered-To: freebsd-stable@freebsd.org
Received: from atrn.bpa.nu (CPE-144-132-209-248.nsw.bigpond.net.au [144.132.209.248])
	by hub.freebsd.org (Postfix) with ESMTP
	id 665DC37B423; Tue, 19 Sep 2000 02:26:35 -0700 (PDT)
Received: from juju.bsn (juju.bsn [192.168.1.5])
	by atrn.bpa.nu (8.9.3/8.9.3) with ESMTP id UAA85708;
	Tue, 19 Sep 2000 20:27:16 +1000 (EST)
	(envelope-from andy@ska.bsn)
Received: (from andy@localhost)
	by juju.bsn (8.9.3/8.9.3) id UAA03568;
	Tue, 19 Sep 2000 20:26:24 +1100 (EST)
	(envelope-from andy)
Message-Id: <200009190926.UAA03568@juju.bsn>
Date: Tue, 19 Sep 2000 20:26:24 +1100 (EST)
From: Andy Newman <atrn@zeta.org.au>
Reply-To: atrn@zeta.org.au
Subject: Re: MFC of ahc driver updates (long-ish)
To: stable@FreeBSD.org
Cc: "Brandon D. Valentine" <bandix@looksharp.net>, gibbs@FreeBSD.org
In-Reply-To: <200009162021.OAA02336@pluto.plutotech.com>
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=US-ASCII
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Justin Gibbs wrote:
> I'd be more than happy to make patches available relative to -stable
> (sys/dev/aic7xxx/... can simply be copied to a 4.X system and it should
> work with an added register define in sys/pci/pcireg.h and some minor
> changes to sys/conf/files), but I can only sanction the merge to stable
> if adequate testing occurrs.

Count me in for testing then.  I currently have a very unstable system
with a 29160. I'm not exactly sure if its the 29160 or vinum that is
causing the problems. There's a new controller coming but while the
29160 is there it may as well be used for some good.

System details are:

	Gigabyte 6vxe+ VIA chipset m/b, PIII 600E, 512MB,
	29160, Seagate 9GB Cheetah + 4x 72GB Cheetahs
	FreeBSD 4.1-STABLE (several versions from release up to today)
	vinum RAID 5 over the 4x 72GB drives, system on the 9GB drive

Initial install hit the firmware problems in the Cheetahs. Quick (ha!)
Windows install on a spare IDE drive and call to Seagate support (very
helpful) fixes that (Windows install takes longest time of course and
Seagate s/w crashes after upgrading firmware on new 72GB disk drive,
means I have to buy new pair of underpants :) It actually upgraded the
firmware okay, just crashed in the process. Did the same for all
72GB drives. Throw away IDE disk. Reboot.

The latest firmware has appeared to cure any troubles with the 9GB
Cheetah (ST39204LW). The vinum'd ST173404LW's however, when under high
random I/O load, panic. Bulk sequential I/O seems fine, vinum can init
the array, dd can fill it with zeros, benchmarks run.  But a "find .
>/dev/null" from the root of the RAID 5 file system will surely panic it
(gee, I get 2am reboots for nothing :)

I've tried various file system configurations with some differences in
behavior. It appeared that soft updates or async mounts were quicker to
panic than a noasync mount (didn't try sync, didn't seem to be much fun
with GB's of I/O).  Guess its the probably higher I/O ops
causing different patterns of buffer, control block usage, interrupt
activity, etc.., more chance of mess up with more complex pattern. I'm
also suspicious of the buffer corruption problem Greg Lehey still
mentions on his vinum know bugs page. Is that still around? Or is the
page stale?

I'm yet to get a good crash dump of the machine. Following the
instructions in the handbook (config -g or a makeoptions DEBUG=-g,
dumpon & dumpdev in rc.conf, everything has enough space but no dumps) 
Many of the panics I've caused remotely which isn't much use either.
One I did observe (just today) was curious ... multiple panics in
succession followed by a total reset (and I really wanted to catch that
one, sigh).

Building a debug kernel helped interestingly.  It was dammed difficult
to make the machine fall over with the debug kernel (It just turns on
symbols doesn't it? No code gen differences are there? Or is there
#ifdef debug code to modify timing sufficiently?)  Multiple concurrent
operations on the array (a make -j4 buildworld's /usr/obj on it, copying
the array out to another machine, copying 1GB of stuff to it at the same
time and multiple find's all running continuously) and it stayed up
(compared to dying with a single find previously). Of course it died
within five minutes of me doing things to it remotely (which explains
this mail :).

I'll try building a system with the newer aic7xxx stuff and let you
know how things go.  It's currently building world to the 9GB drive
with a noasync mount of the RAID array and a 2.5GB copy going to
it via (async) NFS.....(waits)... Copy worked okay (but the buildworld
continues).

Later.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message