From owner-freebsd-fs@FreeBSD.ORG Tue Nov 16 15:29:44 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C1E30106564A for ; Tue, 16 Nov 2010 15:29:44 +0000 (UTC) (envelope-from michaelscotttech@gmail.com) Received: from mail-qy0-f182.google.com (mail-qy0-f182.google.com [209.85.216.182]) by mx1.freebsd.org (Postfix) with ESMTP id 6BC9D8FC0A for ; Tue, 16 Nov 2010 15:29:44 +0000 (UTC) Received: by qyk7 with SMTP id 7so839146qyk.13 for ; Tue, 16 Nov 2010 07:29:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:cc:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=+vFqQ0MdlodfAUS+9n4l15C46YCYETl3UZKXmEZpttk=; b=npAsaOF3l8d2ub7UzifViyI0Dt5dmcummpTba0MEfK4SknRss6Hk8ABCqjH6V4e1z5 P2Kjgh02w/zrj9mAAwNbtRN2zPaG06bYED8f7TDseIfnflkgcXHykG1dQECmS6wjigAd Wvw2fWhM8FkiSr+bkSIz6uhKHxah0FjPEzc6Y= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=cc:message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=ZpH1ZrdOtCVODS+RhyrX057HUjEZpPf4yNIq5UtnYPeoZHrYHqnf2YAflG/77YQSNo HO79JH51PmP3qHSWgbccnLYbX1S1Fg9+vZb8PsiPqdoby0tymkOL2SzaWwHsH3xW7qid a8bUX0FXOcHrgQtX8g2Hsgb0j1dIKTqJlfKLM= Received: by 10.224.80.202 with SMTP id u10mr369942qak.29.1289921383623; Tue, 16 Nov 2010 07:29:43 -0800 (PST) Received: from msb.datacomp-intranet.com (h69-130-231-62.mdsnwi.tisp.static.tds.net [69.130.231.62]) by mx.google.com with ESMTPS id m7sm808903qck.37.2010.11.16.07.29.41 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 16 Nov 2010 07:29:42 -0800 (PST) Message-Id: <99CF1585-9D89-4F66-B85C-67EA30DD0BD9@gmail.com> From: Michael Boers To: Jeremy Chadwick In-Reply-To: <20101116135838.GA91324@icarus.home.lan> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Date: Tue, 16 Nov 2010 10:29:40 -0500 References: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com> <441E3529-6178-404E-8A2D-2CF9BBC4170C@gmail.com> <20101116135838.GA91324@icarus.home.lan> X-Mailer: Apple Mail (2.936) Cc: freebsd-fs@freebsd.org Subject: Re: zfs mirror recognizing disk failures X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Nov 2010 15:29:44 -0000 On Nov 16, 2010, at 8:58 AM, Jeremy Chadwick wrote: > On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote: >> To answer Jermey's question of "what happened next?" >> >> The machine was not serving web requests >> The machine was not responsive via ssh >> The machine was pingable >> >> after waiting about 15 minutes, I used the ipmi protocol to power >> down the machine. >> When it came back up, I found the enclosed errors in the log. >> >> If I am following your comments correctly, the fault for this lies >> in the mpt system not giving up which could either be a driver or a >> firmware issue. Is that correct? >> >> How do I protect against that? > > The fault, in my opinion -- and I urge others (especially those > familiar > with the driver) to correct me, because I am often wrong -- lies with > either with the controller itself, or mpt(4), not truly "giving up" > after repetitive errors. It could be a firmware bug/quirk, sure. It > could be a lot of things, or a combination of things. I don't want to > rule out anything. > > For example, at my workplace we use Solaris with Adaptec controllers, > using a multitude of Fujitsu disks. Everything is SCSI-3. We > regularly > (at least once a week, usually more than that) see disk problems where > either the disk falls off the bus unexpectedly, the drive itself > "wedges" (resulting in the controller getting stuck in an infinite > loop > trying to talk to it) and won't unwedge without a full power-cycle > (soft > reset doesn't work), or in certain bad block circumstances the drive > wedges long enough for the controller driver to break in a strange way > (resulting in a system panic). Each situation appears to be > different; > there's definitely situations where the disk is responsible, others > which look like the controller is responsible, and others which look > like driver issues. > > I'm not familiar (read: have not used) mpt(4) controllers, but if my > memory serves me right, people post about problems with them from time > to time on FreeBSD. Each incident has to be addressed separately. > > If you're asking for a workaround or "what should I do", the > solution is > to either change controllers (read: avoid mpt(4)), or figure out how/ > why > the disk became wedged (or if it even did in the first place). > > Your original post contains no useful information about the hardware > itself (mpt handles many controllers yet we know not what model, we > know > nothing about disk da2, etc.). You're going to need to provide this. > Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and > smartctl -a output for the disk would be useful (assuming the > controller > supports passthrough). Thanks for the detailed response, it has given me some things to think about. You are right, I had not posted too much about the machine in question. For those interested now or who may run across this in the archives, I provide it now (edited and partially reconstructed from backups of the log files): The machine is a Dell PowerEdge 2970 with SAS 6/iR Integrated, x6 Backplane Aug 24 05:40:41 caprica kernel: FreeBSD 8.0-RELEASE #0: Fri Jan 29 14:17:29 EST 2010 Aug 24 05:40:41 caprica kernel: CPU: Quad-Core AMD Opteron(tm) Processor 2387 (2793.03-MHz K8-class CPU) Aug 24 05:40:41 caprica kernel: real memory = 17179869184 (16384 MB) Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: 1 package(s) x 4 core(s) Aug 24 05:40:41 caprica kernel: mpt0: port 0xec00-0xecff mem 0xe9fec000-0xe9feffff,0xe9ff0000-0xe9ffffff irq 37 at device 0.0 on pci7 Aug 24 05:40:41 caprica kernel: mpt0: [ITHREAD] Aug 24 05:40:41 caprica kernel: mpt0: MPI Version=1.5.18.0 Aug 24 05:40:41 caprica kernel: mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) Aug 24 05:40:41 caprica kernel: mpt0: 0 Active Volumes (2 Max) Aug 24 05:40:41 caprica kernel: mpt0: 0 Hidden Drive Members (14 Max) Aug 24 05:40:41 caprica kernel: ZFS filesystem version 13 Aug 24 05:40:41 caprica kernel: ZFS storage pool version 13 Aug 24 05:40:41 caprica kernel: Timecounters tick every 1.000 msec Aug 24 05:40:41 caprica kernel: da0: Fixed Direct Access SCSI-5 device Aug 24 05:40:41 caprica kernel: da0: 300.000MB/s transfers Aug 24 05:40:41 caprica kernel: da0: Command Queueing enabled Aug 24 05:40:41 caprica kernel: da0: 152587MB (312500000 512 byte sectors: 255H 63S/T 19452C) Aug 24 05:40:41 caprica kernel: da1 at mpt0 bus 0 target 1 lun 0 Aug 24 05:40:41 caprica kernel: da1: Fixed Direct Access SCSI-5 device Aug 24 05:40:41 caprica kernel: da1: 300.000MB/s transfers Aug 24 05:40:41 caprica kernel: da1: Command Queueing enabled Aug 24 05:40:41 caprica kernel: da1: 476940MB (976773168 512 byte sectors: 255H 63S/T 60801C) Aug 24 05:40:41 caprica kernel: ses0 at mpt0 bus 0 target 8 lun 0 Aug 24 05:40:41 caprica kernel: ses0: Fixed Enclosure Services SCSI-5 device Aug 24 05:40:41 caprica kernel: ses0: 300.000MB/s transfers Aug 24 05:40:41 caprica kernel: ses0: SCSI-3 SES Device added the mirror disks later Oct 15 10:47:21 caprica kernel: da2 at mpt0 bus 0 target 3 lun 0 Oct 15 10:47:21 caprica kernel: da2: Fixed Direct Access SCSI-5 device Oct 15 10:47:21 caprica kernel: da2: 300.000MB/s transfers Oct 15 10:47:21 caprica kernel: da2: Command Queueing enabled Oct 15 10:47:21 caprica kernel: da2: 476940MB (976773168 512 byte sectors: 255H 63S/T 60801C) Oct 15 10:47:21 caprica kernel: da3 at mpt0 bus 0 target 2 lun 0 Oct 15 10:47:21 caprica kernel: da3: Fixed Direct Access SCSI-5 device Oct 15 10:47:21 caprica kernel: da3: 300.000MB/s transfers Oct 15 10:47:21 caprica kernel: da3: Command Queueing enabled Oct 15 10:47:21 caprica kernel: da3: 152587MB (312500000 512 byte sectors: 255H 63S/T 19452C) started getting the occasional error on da3 (did not realize until after the crash. Now using swatch to check for mpt errors) Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): WRITE(10). CDB: 2a 0 2 4 58 a2 0 0 80 0 Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): CAM Status: SCSI Status Error Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): SCSI Status: Check Condition Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): UNIT ATTENTION asc: 29,0 Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Power on, reset, or bus device reset occurred Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Retrying Command (per Sense Data) Camcontrol output (partially reconstructed as the drives are currently on my desk) at scbus0 target 0 lun 0 (pass0,da0) at scbus0 target 1 lun 0 (pass1,da1) at scbus0 target 2 lun 0 (pass2,da2) at scbus0 target 3 lun 0 (pass2,da3) at scbus0 target 8 lun 0 (ses0,pass4) This is all I can provide at this time. I appreciate all of the help provided thus far and in future. I am going to check into BIOS updates for the SAS 6/iR and I am in the process of moving to 8.1 for better mpt support. Thanks, again > > Finally, be aware that trying to chase down a problem of this nature > is > often time-consuming. Sometimes it's not worth it at all, and instead > better spent replacing all of the hardware involved. If it happens > again after that, change vendors or hardware controllers (or disks) > used. That's just how it goes. I tend to stick to Intel ICHxx or ESB > SATA controllers for this reason; they're well-tested on FreeBSD. > And I > don't use hardware RAID at all for many reasons (separate topic). > > -- > | Jeremy Chadwick jdc@parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, USA | > | Making life hard for others since 1977. PGP: 4BD6C0CB | >