From owner-freebsd-geom@FreeBSD.ORG Mon Aug 12 13:07:07 2013 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 48164AC3 for ; Mon, 12 Aug 2013 13:07:07 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from mail.tdx.com (mail.tdx.com [62.13.128.18]) by mx1.freebsd.org (Postfix) with ESMTP id 0FF1A2C07 for ; Mon, 12 Aug 2013 13:07:06 +0000 (UTC) Received: from Mail-PC.tdx.co.uk (storm.tdx.co.uk [62.13.130.251]) (authenticated bits=0) by mail.tdx.com (8.14.3/8.14.3/) with ESMTP id r7CD6xQU067659 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 12 Aug 2013 14:06:59 +0100 (BST) Date: Mon, 12 Aug 2013 14:07:02 +0100 From: Karl Pielorz To: freebsd-geom@FreeBSD.org Subject: Onboard RAID panic / reboot after CAM timeout? Message-ID: <4C7053FCE24480BF96DF525A@Mail-PC.tdx.co.uk> X-Mailer: Mulberry/4.0.8 (Win32) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Aug 2013 13:07:07 -0000 Hi, I've got a amd64 '9.1-STABLE' box running with the systems 'onboard' RAID, i.e. ahci0: port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 0xdfa22000-0xdfa227ff irq 19 at device 31.2 on pci0 ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported This is setup, and has been running fine: Name Status Components raid/r0 OPTIMAL ada0 (ACTIVE (ACTIVE)) ada1 (ACTIVE (ACTIVE)) The other day the machine picked up a CAM timeout, and rebooted: " ahcich1: Timeout on slot 31 port 0 ahcich1: is 00000000 cs 00000000 ss 80000000 rs 80000000 tfd 40 serr 00000000 cmd 0004df17 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 c0 4a e9 40 03 00 00 00 00 00 (ada1:ahcich1:0:0:0): CAM status: Command timeout (ada1:ahcich1:0:0:0): Retrying command " By the time we'd gotten onto the box it had restarted, and had started rebuilding the RAID array. This completed OK - and it has been OK since. Presumably RAID should have either recovered/handled this, or at least just failed ada1 and continued? Are there any known issues with CAM timeouts on graid'ed drives not being survivable? Cheers, -Karl