Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Mar 2003 23:15:32 +0000
From:      Scott Mitchell <scott+freebsd@fishballoon.org>
To:        freebsd-questions@freebsd.org
Subject:   Strange crash, possibly vinum-related
Message-ID:  <20030310231532.GD522@tuatara.fishballoon.org>

next in thread | raw e-mail | index | archive | help
Hi all,

I wonder if anyone out there can shed any light on this:

A drive failed on one of our Vinum-powered RAID-5 arrays over the weekend.
This morning, we swapped out the offending drive (hot-swappable SCSI
hardware), disklabel-ed it and restarted the offending subdisk.  Everything
seemed fine at this point, with vinum happily reviving the stale subdisk.

However, twenty minutes later, with the revive 29% complete, I got this in
/var/log/messages:

Mar 10 11:39:50 kokako vinum[12708]: can't revive raid.p0.s0: Invalid argument

'vinum list' was also showing an error message, which I foolishly didn't
capture, something along the lines of 'the revive process died'.  Lacking
any better ideas, I started the subdisk again.  The revival seemed to pick
up where it left off.

Half an hour later, the box rebooted :-(  I wasn't actually watching it at
the time, so I don't know if it finished reviving the subdisk or not.
There's no indication in the logs as to what happened, but the timing of
the reboot is consistent with it happening around the time the subdisk
would have come back to life.

Once the box came back up, I restarted the subdisk yet again (I had to
create the drive again first), with the RAID volume unmounted.  This time
the process finished without complaints and things seem to be working as
well as ever since then.

Any ideas as to why it might have just spontaneously rebooted?  I'm naively
assuming vinum is somehow involved, since the machine has always been
extremely stable, and this is the first disk failure we've had on it.  It's
just as likely to be my own stupid fault though -- it occurred to me later
that I should probably have done a 'camcontrol rescan' after swapping the
drive.  Nothing complained, but maybe I just got lucky because it was an
identical drive?  Also, one of my colleagues was hammering the volume
pretty hard while the revive was going on, copying a large number of
smallish files onto it - maybe the load triggered some bad behaviour?

Thanks in advance,

	Scott


Here's the rest of the information requested by
http://www.vinumvm.org/vinum/how-to-debug.html:

The machine is a PIII-700 on an Intel 440GX board, 512MB RAM.  Adaptec
aic7896/97 U2 SCSI controller, all the disks are 36GB IBM DDYS units,
10Krpm, SCA connectors.

(501) ~ $ uname -a
FreeBSD kokako 4.6-RELEASE-p1 FreeBSD 4.6-RELEASE-p1 #0: Fri Jun 28 13:39:16 BST 2002     rsm@kokako:/scratch/obj/usr/src/sys/KOKAKO  i386


kokako# vinum list
5 drives:
D d0                    State: up       Device /dev/da0a        Avail: 1/35003 MB (0%)
D d1                    State: up       Device /dev/da1a        Avail: 1/35003 MB (0%)
D d2                    State: up       Device /dev/da2a        Avail: 1/35003 MB (0%)
D d3                    State: up       Device /dev/da3a        Avail: 1/35003 MB (0%)
D d4                    State: up       Device /dev/da4a        Avail: 1/35003 MB (0%)

1 volumes:
V raid                  State: up       Plexes:       1 Size:        136 GB

1 plexes:
P raid.p0            R5 State: up       Subdisks:     5 Size:        136 GB

5 subdisks:
S raid.p0.s0            State: up       PO:        0  B Size:         34 GB
S raid.p0.s1            State: up       PO:     2003 kB Size:         34 GB
S raid.p0.s2            State: up       PO:     4006 kB Size:         34 GB
S raid.p0.s3            State: up       PO:     6009 kB Size:         34 GB
S raid.p0.s4            State: up       PO:     8012 kB Size:         34 GB


# Relevant bits of vinum_history:
# Why are some of these 'start' lines duplicated and out of order?
10 Mar 2003 11:22:09.337534 start raid.p0.s0 
10 Mar 2003 11:22:17.505487 l -r raid 
10 Mar 2003 11:23:05.442661 l -r raid 
[...]
10 Mar 2003 11:22:09.337534 start raid.p0.s0 
10 Mar 2003 11:45:29.401210 *** vinum started ***
10 Mar 2003 11:45:30.305911 l 
10 Mar 2003 11:46:42.610802 start raid.p0.s0 
10 Mar 2003 11:46:47.382081 l -r raid 
10 Mar 2003 11:47:00.815044 l -r raid 
[...]
[Reboot happened here]
10 Mar 2003 12:51:10.003180 *** vinum started ***
10 Mar 2003 12:51:13.544487 list -r raid 
10 Mar 2003 12:51:25.151100 start raid.p0.s0 
10 Mar 2003 12:51:30.581583 list 
10 Mar 2003 12:52:32.837161 quit 
10 Mar 2003 12:54:26.495344 *** vinum started ***
10 Mar 2003 12:54:26.495817 create temp.conf 
drive d0 device /dev/da0a
10 Mar 2003 12:54:26.512027 *** Created devices ***
10 Mar 2003 12:54:53.493176 *** vinum started ***
10 Mar 2003 12:54:55.385954 l 
10 Mar 2003 12:55:18.403984 start raid.p0.s0 
10 Mar 2003 12:55:22.774523 l -r raid 
10 Mar 2003 13:07:05.527444 l -r raid 
[...]
10 Mar 2003 12:55:18.403984 start raid.p0.s0 
10 Mar 2003 13:52:15.579181 l -r raid 
10 Mar 2003 13:52:17.925615 l 
10 Mar 2003 13:52:33.437222 l -v 
10 Mar 2003 13:53:19.637191 l 

-- 
===========================================================================
Scott Mitchell           | PGP Key ID | "Eagles may soar, but weasels
Cambridge, England       | 0x54B171B9 |  don't get sucked into jet engines"
scott at fishballoon.org | 0xAA775B8B |      -- Anon

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030310231532.GD522>