From owner-freebsd-stable@FreeBSD.ORG Thu Jan 13 00:30:43 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B8DE516A4CE for ; Thu, 13 Jan 2005 00:30:43 +0000 (GMT) Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7B59E43D1D for ; Thu, 13 Jan 2005 00:30:43 +0000 (GMT) (envelope-from ambrisko@ambrisko.com) Received: from server2.ambrisko.com (HELO www.ambrisko.com) (192.168.1.2) by mail.ambrisko.com with ESMTP; 12 Jan 2005 16:30:43 -0800 Received: from ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.12.11/8.12.9) with ESMTP id j0D0Uguh089501; Wed, 12 Jan 2005 16:30:42 -0800 (PST) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.12.11/8.12.11/Submit) id j0D0UgxO089500; Wed, 12 Jan 2005 16:30:42 -0800 (PST) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200501130030.j0D0UgxO089500@ambrisko.com> In-Reply-To: <1433078378.20050111134014@byrnehq.com> To: Tony Byrne Date: Wed, 12 Jan 2005 16:30:42 -0800 (PST) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII cc: freebsd-stable@freebsd.org Subject: Re: MegaRAID 'Bad Slot' Kernel message and crash. X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Jan 2005 00:30:43 -0000 Tony Byrne writes: | Basically, after some amount of uptime the kernel will emit a "amr0: | Bad slot x completed" message and pretty soon after this the box goes into a | partially unresponsive state forcing us to reboot it. So far the only | thing triggering the problem is the nightly jobs, where the amount of | IO is higher than during the day. | | Before deployment, we tested the box with 5.3-STABLE and managed to | trigger the problem twice. This forced us to try 4.10-STABLE which | was fine in testing and for a number of weeks after deployment. | However, just before new year we saw our first Bad Slot and crash under | 4.10. Since then it has happened 3 more times. We have upgraded the firmware to | the latest version available from Intel, and if anything this has made | the problem worse. | | The machine had 3 disks configured as a single RAID5 array. A fourth | disk is configured as a hot-standby. The card is equipped with 128Mb | of battery-backed cache. Write-back caching is enabled on the card. | Read-ahead caching is enabled in non-adaptive mode. | | Is anyone else using a SRCU42X RAID card and seeing similar | problems to ours? What about other cards supported by the amr driver? We run RAID 10 across 4 drives at work on Dell PE2850's which have amr RAID's and no-one has reported this problem to me (which they do). We run FreeBSD 4.10 & 5.3 on them. This is with and without our local mods. We have most experience with 4.10. Dell has their own firmware version (atleast to call it is a PERC controller). For now this is a "works for me". Doug A.