From owner-freebsd-geom@FreeBSD.ORG  Mon Aug 12 13:07:07 2013
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 48164AC3
 for <freebsd-geom@FreeBSD.org>; Mon, 12 Aug 2013 13:07:07 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from mail.tdx.com (mail.tdx.com [62.13.128.18])
 by mx1.freebsd.org (Postfix) with ESMTP id 0FF1A2C07
 for <freebsd-geom@FreeBSD.org>; Mon, 12 Aug 2013 13:07:06 +0000 (UTC)
Received: from Mail-PC.tdx.co.uk (storm.tdx.co.uk [62.13.130.251])
 (authenticated bits=0)
 by mail.tdx.com (8.14.3/8.14.3/) with ESMTP id r7CD6xQU067659
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
 for <freebsd-geom@FreeBSD.org>; Mon, 12 Aug 2013 14:06:59 +0100 (BST)
Date: Mon, 12 Aug 2013 14:07:02 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: freebsd-geom@FreeBSD.org
Subject: Onboard RAID panic / reboot after CAM timeout?
Message-ID: <4C7053FCE24480BF96DF525A@Mail-PC.tdx.co.uk>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Aug 2013 13:07:07 -0000


Hi,

I've got a amd64 '9.1-STABLE' box running with the systems 'onboard' RAID, 
i.e.

ahci0: <Intel ICH8 AHCI SATA controller> port 
0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 
0xdfa22000-0xdfa227ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported


This is setup, and has been running fine:

   Name   Status  Components
raid/r0  OPTIMAL  ada0 (ACTIVE (ACTIVE))
                  ada1 (ACTIVE (ACTIVE))


The other day the machine picked up a CAM timeout, and rebooted:

"
ahcich1: Timeout on slot 31 port 0
ahcich1: is 00000000 cs 00000000 ss 80000000 rs 80000000 tfd 40 serr 
00000000 cmd 0004df17
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 c0 4a e9 40 03 00 00 
00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
"

By the time we'd gotten onto the box it had restarted, and had started 
rebuilding the RAID array. This completed OK - and it has been OK since.

Presumably RAID should have either recovered/handled this, or at least just 
failed ada1 and continued?

Are there any known issues with CAM timeouts on graid'ed drives not being 
survivable?

Cheers,

-Karl