From owner-freebsd-questions Mon Mar 10 15:16:20 2003 Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6DBCB37B401 for ; Mon, 10 Mar 2003 15:16:16 -0800 (PST) Received: from mta06-svc.ntlworld.com (mta06-svc.ntlworld.com [62.253.162.46]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2E22A43F85 for ; Mon, 10 Mar 2003 15:16:15 -0800 (PST) (envelope-from scott@fishballoon.org) Received: from fishballoon.org ([80.4.125.54]) by mta06-svc.ntlworld.com (InterMail vM.4.01.03.27 201-229-121-127-20010626) with ESMTP id <20030310231613.KXBI26467.mta06-svc.ntlworld.com@fishballoon.org> for ; Mon, 10 Mar 2003 23:16:13 +0000 Received: from tuatara.fishballoon.org (tuatara [192.168.1.6]) by fishballoon.org (8.12.6/8.12.6) with ESMTP id h2ANFi9A020805 for ; Mon, 10 Mar 2003 23:15:44 GMT (envelope-from scott@tuatara.fishballoon.org) Received: (from scott@localhost) by tuatara.fishballoon.org (8.12.7/8.12.6/Submit) id h2ANFXHA048631 for freebsd-questions@freebsd.org; Mon, 10 Mar 2003 23:15:33 GMT (envelope-from scott) Date: Mon, 10 Mar 2003 23:15:32 +0000 From: Scott Mitchell To: freebsd-questions@freebsd.org Subject: Strange crash, possibly vinum-related Message-ID: <20030310231532.GD522@tuatara.fishballoon.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4i X-Operating-System: FreeBSD 4.8-PRERELEASE i386 Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hi all, I wonder if anyone out there can shed any light on this: A drive failed on one of our Vinum-powered RAID-5 arrays over the weekend. This morning, we swapped out the offending drive (hot-swappable SCSI hardware), disklabel-ed it and restarted the offending subdisk. Everything seemed fine at this point, with vinum happily reviving the stale subdisk. However, twenty minutes later, with the revive 29% complete, I got this in /var/log/messages: Mar 10 11:39:50 kokako vinum[12708]: can't revive raid.p0.s0: Invalid argument 'vinum list' was also showing an error message, which I foolishly didn't capture, something along the lines of 'the revive process died'. Lacking any better ideas, I started the subdisk again. The revival seemed to pick up where it left off. Half an hour later, the box rebooted :-( I wasn't actually watching it at the time, so I don't know if it finished reviving the subdisk or not. There's no indication in the logs as to what happened, but the timing of the reboot is consistent with it happening around the time the subdisk would have come back to life. Once the box came back up, I restarted the subdisk yet again (I had to create the drive again first), with the RAID volume unmounted. This time the process finished without complaints and things seem to be working as well as ever since then. Any ideas as to why it might have just spontaneously rebooted? I'm naively assuming vinum is somehow involved, since the machine has always been extremely stable, and this is the first disk failure we've had on it. It's just as likely to be my own stupid fault though -- it occurred to me later that I should probably have done a 'camcontrol rescan' after swapping the drive. Nothing complained, but maybe I just got lucky because it was an identical drive? Also, one of my colleagues was hammering the volume pretty hard while the revive was going on, copying a large number of smallish files onto it - maybe the load triggered some bad behaviour? Thanks in advance, Scott Here's the rest of the information requested by http://www.vinumvm.org/vinum/how-to-debug.html: The machine is a PIII-700 on an Intel 440GX board, 512MB RAM. Adaptec aic7896/97 U2 SCSI controller, all the disks are 36GB IBM DDYS units, 10Krpm, SCA connectors. (501) ~ $ uname -a FreeBSD kokako 4.6-RELEASE-p1 FreeBSD 4.6-RELEASE-p1 #0: Fri Jun 28 13:39:16 BST 2002 rsm@kokako:/scratch/obj/usr/src/sys/KOKAKO i386 kokako# vinum list 5 drives: D d0 State: up Device /dev/da0a Avail: 1/35003 MB (0%) D d1 State: up Device /dev/da1a Avail: 1/35003 MB (0%) D d2 State: up Device /dev/da2a Avail: 1/35003 MB (0%) D d3 State: up Device /dev/da3a Avail: 1/35003 MB (0%) D d4 State: up Device /dev/da4a Avail: 1/35003 MB (0%) 1 volumes: V raid State: up Plexes: 1 Size: 136 GB 1 plexes: P raid.p0 R5 State: up Subdisks: 5 Size: 136 GB 5 subdisks: S raid.p0.s0 State: up PO: 0 B Size: 34 GB S raid.p0.s1 State: up PO: 2003 kB Size: 34 GB S raid.p0.s2 State: up PO: 4006 kB Size: 34 GB S raid.p0.s3 State: up PO: 6009 kB Size: 34 GB S raid.p0.s4 State: up PO: 8012 kB Size: 34 GB # Relevant bits of vinum_history: # Why are some of these 'start' lines duplicated and out of order? 10 Mar 2003 11:22:09.337534 start raid.p0.s0 10 Mar 2003 11:22:17.505487 l -r raid 10 Mar 2003 11:23:05.442661 l -r raid [...] 10 Mar 2003 11:22:09.337534 start raid.p0.s0 10 Mar 2003 11:45:29.401210 *** vinum started *** 10 Mar 2003 11:45:30.305911 l 10 Mar 2003 11:46:42.610802 start raid.p0.s0 10 Mar 2003 11:46:47.382081 l -r raid 10 Mar 2003 11:47:00.815044 l -r raid [...] [Reboot happened here] 10 Mar 2003 12:51:10.003180 *** vinum started *** 10 Mar 2003 12:51:13.544487 list -r raid 10 Mar 2003 12:51:25.151100 start raid.p0.s0 10 Mar 2003 12:51:30.581583 list 10 Mar 2003 12:52:32.837161 quit 10 Mar 2003 12:54:26.495344 *** vinum started *** 10 Mar 2003 12:54:26.495817 create temp.conf drive d0 device /dev/da0a 10 Mar 2003 12:54:26.512027 *** Created devices *** 10 Mar 2003 12:54:53.493176 *** vinum started *** 10 Mar 2003 12:54:55.385954 l 10 Mar 2003 12:55:18.403984 start raid.p0.s0 10 Mar 2003 12:55:22.774523 l -r raid 10 Mar 2003 13:07:05.527444 l -r raid [...] 10 Mar 2003 12:55:18.403984 start raid.p0.s0 10 Mar 2003 13:52:15.579181 l -r raid 10 Mar 2003 13:52:17.925615 l 10 Mar 2003 13:52:33.437222 l -v 10 Mar 2003 13:53:19.637191 l -- =========================================================================== Scott Mitchell | PGP Key ID | "Eagles may soar, but weasels Cambridge, England | 0x54B171B9 | don't get sucked into jet engines" scott at fishballoon.org | 0xAA775B8B | -- Anon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message