From owner-freebsd-stable@FreeBSD.ORG  Thu Mar 31 21:00:22 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C40A216A4CE
	for <freebsd-stable@freebsd.org>;
	Thu, 31 Mar 2005 21:00:22 +0000 (GMT)
Received: from mailserver.sandvine.com (sandvine.com [199.243.201.138])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1CE4E43D46
	for <freebsd-stable@freebsd.org>;
	Thu, 31 Mar 2005 21:00:22 +0000 (GMT)
	(envelope-from don@SANDVINE.com)
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 31 Mar 2005 16:00:21 -0500
X-MimeOLE: Produced By Microsoft Exchange V6.0.6603.0
Message-ID: <2BCEB9A37A4D354AA276774EE13FB8C23A690B@mailserver.sandvine.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Thread-Index: AcU2LfUw6Xb820hyRi6ckEQZlNcJ7QABhojQ
From: "Don Bowman" <don@SANDVINE.com>
To: "Uwe Doering" <gemini@geminix.org>
cc: freebsd-stable@freebsd.org
Subject: RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Mar 2005 21:00:22 -0000

From: Uwe Doering [mailto:gemini@geminix.org]=20
> Don Bowman wrote:
> > From: owner-freebsd-stable@freebsd.org
> >=20
> >>From: Uwe Doering [mailto:gemini@geminix.org]  ...
> >>
> >>>>Did you merge 1.3.2.3 as well?  This actually should have
> >>>
> >>>been one MFC
> >>
> >>Yes, merged from RELENG_4.
> >>
> >>I will post later if this happens again, but it will be=20
> quite a long=20
> >>time. The machine has 7 drives in it, there are only
> >>3 ones left old enough they might fail before I take it out=20
> of service=20
> >>(it originally had 7 1999-era IBM drives, now it has 4 2004-era=20
> >>seagate drives and 3 of the old IBM's.
> >>The drives have been in continuous service, so they've lead=20
> a pretty=20
> >>good life!)
> >>
> >>Thanks for the suggestion on the cam timeout, I've set that value.
> >=20
> > Another drive failed and the same thing happened.
> > After the failure, the raid worked in degrade mode just=20
> fine, but many=20
> > files had been corrupted during the failure.
> >=20
> > So I would suggest that this merge did not help, and the=20
> cam timeout=20
> > did not help either.
> >=20
> > This is very frustrating, again I rebuild my postgresql=20
> install from=20
> > backup :(
>=20
> This is indeed unfortunate.  Maybe the problem is in fact=20
> located neither in PostgreSQL nor in FreeBSD but in the=20
> controller itself.  Does it have the latest firmware?  The=20
> necessary files should be available on Adaptec's website, and=20
> you can use the 'raidutil' program under FreeBSD to upload=20
> the firmware to the controller.  I have to concede, however,=20
> that I never did this under FreeBSD myself.  If I recall=20
> correctly I did the upload via a DOS diskette the last time.
>=20
> If this doesn't help either you could ask Adaptec's support for help.=20
> You need to register the controller first, if memory serves.

The latest firmware & bios is in the controller (upgraded the
last time I had problems).

Tried adaptec support, controller is registered.

The problem is definitely not in postgresql. Files go missing
in directories that are having new entries added (e.g. I lost
a 'PG_VERSION' file). Data within the postgresql files becomes
corrupt. Since the only application running is postgresql,
and it reads/writes/fsyncs the data, its not unexpected that
it's the one that reaps the 'rewards' of the failure.

I have to believe this is either a bug in the controller,
or a problem in cam or asr.

--don