From owner-freebsd-stable@FreeBSD.ORG  Fri Apr  1 16:27:49 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id EB6A416A4CE
	for <freebsd-stable@freebsd.org>;
	Fri,  1 Apr 2005 16:27:48 +0000 (GMT)
Received: from mailserver.sandvine.com (sandvine.com [199.243.201.138])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 40FF743D48
	for <freebsd-stable@freebsd.org>;
	Fri,  1 Apr 2005 16:27:48 +0000 (GMT)
	(envelope-from don@SANDVINE.com)
X-MimeOLE: Produced By Microsoft Exchange V6.0.6603.0
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Date: Fri, 1 Apr 2005 11:27:47 -0500
Message-ID: <2BCEB9A37A4D354AA276774EE13FB8C23A6939@mailserver.sandvine.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Thread-Index: AcU2OzrO9YWYVzmgTiyumGQvOaMzPwAnDyZA
From: "Don Bowman" <don@SANDVINE.com>
To: "Uwe Doering" <gemini@geminix.org>
cc: freebsd-stable@freebsd.org
Subject: RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Apr 2005 16:27:49 -0000

From: Uwe Doering [mailto:gemini@geminix.org]=20
 ...
> As far as I understand this family of controllers the OS=20
> drivers aren't involved at all in case of a disk drive=20
> failure.  It's strictly the controller's business to deal=20
> with it internally.  The OS just sits there and waits until=20
> the controller is done with the retries and either drops into=20
> degraded mode or recovers from the disk error.
>=20
> That's why I initially speculated that there might be a=20
> timeout somewhere in PostgreSQL or FreeBSD that leads to data=20
> loss if the controller is busy for too long.
>=20
> A somewhat radical way to at least make these failures as=20
> rare an event as possible would be to deliberately fail all=20
> remaining old disk drives, one after the other of course, in=20
> order to get rid of them.  And if you are lucky the problem=20
> won't happen with newer drives anyway, in case the root cause=20
> is an incompatibility between the controller and the old drives.

Started that yesterday. I've got one 'old' one left.
Sadly, the one that failed night before last was not one of the
'old' ones, so this is no guarantee :)

>From the raidutil -e log, I see this type of info. I'm not sure=20
what the 'unknown' events are. The 'CRC Failure' is probably the
problem? There's also Bad SCSI Status, unit attention, etc.
Perhaps the driver doesn't deal with these properly?

$ raidutil -e d0
03/31/2005  23:37:59   Level 1
Lock for Channel 0 : Started


03/31/2005  23:37:59   Level 1
Lock for Channel 1 : Started


03/31/2005  23:38:09   Level 1
Lock for Channel 0 : Stopped


03/31/2005  23:38:22   Level 1
Lock for Channel 1 : Stopped


03/31/2005  23:38:22   Level 4
HBA=3D0 BUS=3D0 ID=3D0 LUN=3D0
Status Change
Optimal   =3D> Degraded - Drive Failed


03/31/2005  23:38:22   Level 1
Unknown Event : 56 10 00 08 EE 89 4C 42 00 00 00 00=20


03/31/2005  23:38:22   Level 1
CRC Failure
Number of dirty blocks =3D -1
FFFFFFFF D30A1F2A 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000=20


03/31/2005  23:38:24   Level 3
HBA=3D0 BUS=3D0 ID=3D0 LUN=3D0
Bad SCSI Status - Check Condition
28 00 00 00 00 00 00 00 01 00 00 00=20


03/31/2005  23:38:24   Level 3
HBA=3D0 BUS=3D0 ID=3D0 LUN=3D0
Request Sense
70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00=20
Unit Attention