From owner-freebsd-stable@FreeBSD.ORG  Mon Feb 28 23:27:11 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E6D1A16A4CE
	for <freebsd-stable@freebsd.org>;
	Mon, 28 Feb 2005 23:27:11 +0000 (GMT)
Received: from gen129.n001.c02.escapebox.net (gen129.n001.c02.escapebox.net
	[213.73.91.129])
	by mx1.FreeBSD.org (Postfix) with ESMTP id EE94043D2F
	for <freebsd-stable@freebsd.org>;
	Mon, 28 Feb 2005 23:27:10 +0000 (GMT)
	(envelope-from gemini@geminix.org)
Message-ID: <4223A8C9.5060702@geminix.org>
Date: Tue, 01 Mar 2005 00:27:05 +0100
From: Uwe Doering <gemini@geminix.org>
Organization: Private UNIX Site
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.5) Gecko/20050130
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Don Bowman <don@SANDVINE.com>
References: <2BCEB9A37A4D354AA276774EE13FB8C224D34D@mailserver.sandvine.com>
In-Reply-To: <2BCEB9A37A4D354AA276774EE13FB8C224D34D@mailserver.sandvine.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Received: from gemini by geminix.org with asmtp (TLSv1:AES256-SHA:256)
	(Exim 3.36 #1)
	id 1D5uI8-000OgG-00; Tue, 01 Mar 2005 00:27:08 +0100
cc: freebsd-stable@freebsd.org
Subject: Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Feb 2005 23:27:12 -0000

Don Bowman wrote:
> I have a machine running:
> 
> $ uname -a
> FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0:
> Fri Mar 19 10:39:07 EST 2004
> user@machine.phaedrus.sandvine.com:/usr/src/sys/compile/LABDB  i386
> 
> It has an adaptec 3210S raid controller running a single raid-5, and
> runs postgresql 7.4.6 as its primary application.
> 
> 3 times now I have had a drive fail, and have had corrupted files in the
> postgresql cluster @ the same time.
> 
> The time is too closely correlated to be a coincidence. It passes fsck @
> the time that I got to it a couple of hours later, and the filesystem
> seems to be ok (with a failed drive, the raid in 'degrade' mode).
> 
> It appears that the drive failure and the postgresql failure occur @
> exactly the same time (monitoring with nagios, within 1hr accuracy). It
> would appear that for some file(s) bad data was returned.
> 
> Does anyone have any suggestions?

In my experience, in a situation like this RAID controllers can block 
the system for up to a couple of minutes, trying to revive a failed disk 
drive by sending it bus reset commands and the like, until they 
eventually give up and drop into degraded mode.  With sufficiently 
patient applications this is no problem, but if a program runs into 
internal timeouts during this period of time bad things can happen.

My point is that while the disk controller may trigger the problem the 
instance that actually corrupts the database might be PostgreSQL itself. 
  Of course, I'm aware that it's going to be quite hard to tell for sure 
who the culprit is.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net