From owner-freebsd-questions@FreeBSD.ORG  Thu Nov 11 18:35:18 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1711910656A7
	for <freebsd-questions@freebsd.org>;
	Thu, 11 Nov 2010 18:35:18 +0000 (UTC)
	(envelope-from michaelscotttech@gmail.com)
Received: from mail-qy0-f182.google.com (mail-qy0-f182.google.com
	[209.85.216.182])
	by mx1.freebsd.org (Postfix) with ESMTP id BF2EB8FC1A
	for <freebsd-questions@freebsd.org>;
	Thu, 11 Nov 2010 18:35:17 +0000 (UTC)
Received: by qyk5 with SMTP id 5so1086975qyk.13
	for <freebsd-questions@freebsd.org>;
	Thu, 11 Nov 2010 10:35:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:from:to
	:content-type:content-transfer-encoding:mime-version:subject:date
	:x-mailer; bh=OKYik2CUkOjuTVKxtc8WIhGjg5U9X+GQynYdK1riR0k=;
	b=YzC+nYu80DlhL1kHKMLYKWpDsGBly0nVzTsfMlfLCsfCiG4XaVi6EtBGSnEdQO/Qc9
	a2/nh14txMIwzpWjqDf1bMP37Zd0UXjxyQmNKwD1pF1x6SIV+kc69eXJLJTvxahVVtB1
	IKHhcVJGBfx1bMk7fJOEUK/JbbtCggPQaxylA=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:from:to:content-type:content-transfer-encoding
	:mime-version:subject:date:x-mailer;
	b=bcseRriFTSVu/rP2AbXKD3PJ6lmq/BIbtiQ5kh5In4jF3pRkAQXJ6TGN8X51nFrUoT
	1YLuzn3ytlVWx3eGNdRp98PEB31mumzZ/2PmTFaYCFHvN432wksoKZ2bJvb/nrrHjJhp
	eyosjX0CDl5SF4GI7vdcJUCtNVc30tMxzQa14=
Received: by 10.224.20.13 with SMTP id d13mr1146352qab.108.1289499176844;
	Thu, 11 Nov 2010 10:12:56 -0800 (PST)
Received: from msb.datacomp-intranet.com
	(h69-130-231-62.mdsnwi.tisp.static.tds.net [69.130.231.62])
	by mx.google.com with ESMTPS id n7sm2132259qcu.28.2010.11.11.10.12.55
	(version=TLSv1/SSLv3 cipher=RC4-MD5);
	Thu, 11 Nov 2010 10:12:56 -0800 (PST)
Message-Id: <0C3B7D09-CF38-40D9-A483-F5860DE16652@gmail.com>
From: Michael Boers <michaelscotttech@gmail.com>
To: freebsd-questions@freebsd.org
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Date: Thu, 11 Nov 2010 13:12:55 -0500
X-Mailer: Apple Mail (2.936)
Subject: zfs mirrors and high availability
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Nov 2010 18:35:18 -0000

I am running a 100% zfs based FreeBSD 8.0 system with 4 disks: two zfs  
mirrored boot drives and two zfs mirrored data drives.  This morning  
the server went down with the following errors in the log file:

Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SYNCHRONIZE  
CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): CAM Status: SCSI  
Status Error
Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SCSI Status: Check  
Condition
Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): ABORTED COMMAND asc: 
0,0
Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): No additional sense  
information
Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): Retries Exhausted
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003c87a0:2838  
timed out for ccb 0xffffff0103acc000 (req->ccb 0xffffff0103acc000)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003c5110:2839  
timed out for ccb 0xffffff035cab0800 (req->ccb 0xffffff035cab0800)
Nov 11 10:05:53 caprica kernel: mpt0: attempting to abort req  
0xffffff80003c87a0:2838 function 0
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003bef30:2840  
timed out for ccb 0xffffff0007986800 (req->ccb 0xffffff0007986800)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003c8560:2841  
timed out for ccb 0xffffff032d985000 (req->ccb 0xffffff032d985000)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003bf320:2842  
timed out for ccb 0xffffff0103af2000 (req->ccb 0xffffff0103af2000)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003cbda0:2843  
timed out for ccb 0xffffff0103b0b000 (req->ccb 0xffffff0103b0b000)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003bfd40:2844  
timed out for ccb 0xffffff00102bf800 (req->ccb 0xffffff00102bf800)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003cad50:2845  
timed out for ccb 0xffffff01e6f33000 (req->ccb 0xffffff01e6f33000)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003caf00:2846  
timed out for ccb 0xffffff01e6f24800 (req->ccb 0xffffff01e6f24800)
Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003ccd60:2847  
timed out for ccb 0xffffff01308a4000 (req->ccb 0xffffff01308a4000)

Why didn't zfs stop talking to the disk that was clearly having  
issues?  Are there sysctl or other variables that I can set that will  
allow zfs to mark a disk as failed more aggressively?  Is there a way  
that I could have prevented the crash?

The system was "up", pingable, but not accessible via ssh.  My guess  
is that all disk related requests were queueing/stuck.

A few more notes on my setup:

Harware: Dell PowerEdge 2970, 1 CPU, 16 GB Ram

   pool: Storage
  state: ONLINE
  scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	Storage     ONLINE       0     0     0
	  mirror    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

errors: No known data errors

   pool: zboot
  state: ONLINE
  scrub: scrub in progress for 0h22m, 72.03% done, 0h8m to go
config:

	NAME           STATE     READ WRITE CKSUM
	zboot          ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    gpt/disk0  ONLINE       0     0     0
	    gpt/disk1  ONLINE       0     0     0

--
Thanks!