Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 26 Oct 2001 09:14:55 +0930
From:      Greg Lehey <grog@FreeBSD.org>
To:        Ben Eisenbraun <bene@nitrogen.nexthop.net>
Cc:        freebsd-questions@FreeBSD.org
Subject:   Re: recovery of corrupt vinum plexes?
Message-ID:  <20011026091455.A4706@wantadilla.lemis.com>
In-Reply-To: <20011025023359.D64298@nitrogen.nexthop.net>; from bene@nitrogen.nexthop.net on Thu, Oct 25, 2001 at 02:33:59AM -0400
References:  <20011023044950.A43848@nitrogen.nexthop.net> <20011023183023.M27668@wantadilla.lemis.com> <20011023055005.A44324@nitrogen.nexthop.net> <20011025103000.A25441@wantadilla.lemis.com> <20011025023359.D64298@nitrogen.nexthop.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, 25 October 2001 at  2:33:59 -0400, Ben Eisenbraun wrote:
> On Thu, Oct 25, 2001 at 10:30:00AM +0930, Greg Lehey wrote:
>> On Tuesday, 23 October 2001 at  5:50:05 -0400, Ben Eisenbraun wrote:
> <snip a db> trace>
>
>> Hmm.  That could have been just about anything, probably a corrupt
>> request structure.  Without a dump it's difficult to say very much,
>> but in view of the fact that the drives have gone away, it's possible
>> that it was trying to talk to them anyway.  I'd like to see a dump of
>> this.
>
> I think I can reproduce this if I frob with it enough, but this
> kernel wasn't build with debug symbols, and the system sources have
> been recently updated, so I'm not sure if this would be useful.  Do
> you have any suggestions?

I'd need the debug output.  I won't find it otherwise.

>> OK, let's hope that only the Vinum labels are corrupted.  You have a
>> fair chance that the data section hasn't been overwritten, since
>> there's a copy of the config information (128 kB) between the label
>> and the data.  In that case, you should be able to recreate the
>> objects with this config file:
>>
>> device max3 device /dev/ad4s1e
>> device max4 device /dev/ad6s1e
>
> "device" in the first column produced an error.  Looking at the config,
> I figured you meant "drive max# device /dev/etc", so I swapped "drive"
> for the first "device" in each line.  Here's the output of the 'create',
> 'start', 'list' and 'list -v' commands.
>
> vinum -> list
> 8 drives:
> D max3                  State: up       Device /dev/ad4e        Avail: 0/57239 MB (0%)
> D max4                  State: up       Device /dev/ad6e        Avail: 0/57239 MB (0%)
> D max1                  State: up       Device /dev/ad0s1e      Avail: 0/19529 MB (0%)
> D max2                  State: up       Device /dev/ad2s1e      Avail: 0/19529 MB (0%)
> D wd1                   State: up       Device /dev/ad8s1e      Avail: 0/57239 MB (0%)
> D wd2                   State: up       Device /dev/ad10s1e     Avail: 0/57239 MB (0%)
>
> 2 volumes:
> V stripe-mirror         State: up       Plexes:       2 Size:        111 GB
> V var-mirror            State: up       Plexes:       2 Size:         19 GB
>
> 4 plexes:
> P stripe-mirror.p0    S State: corrupt  Subdisks:     2 Size:        111 GB
> P stripe-mirror.p1    S State: corrupt  Subdisks:     2 Size:        111 GB
> P var-mirror.p0       C State: up       Subdisks:     1 Size:         19 GB
> P var-mirror.p1       C State: up       Subdisks:     1 Size:         19 GB
>
> 6 subdisks:
> S stripe-mirror.p0.s0   State: crashed  PO:        0  B Size:         55 GB
> S stripe-mirror.p0.s1   State: up       PO:      512 kB Size:         55 GB
> S stripe-mirror.p1.s0   State: crashed  PO:        0  B Size:         55 GB
> S stripe-mirror.p1.s1   State: up       PO:      512 kB Size:         55 GB
> S var-mirror.p0.s0      State: up       PO:        0  B Size:         19 GB
> S var-mirror.p1.s0      State: up       PO:        0  B Size:         19 GB

Good!

> For some reason, the drives came back at ad[46]e not ad[46]s1e.

This is not so good.  It suggests that the partition table has
changed.  Are you sure you haven't run fdisk or disklabel on these
drives?

> I was thinking about recent changes to the system config, since
> this setup had run reliably for several months off the same
> sources, and it was rebooted about 1 week before this crash
> happened.  Some new sysctl's took effect after that reboot:
>
> kern.ipc.somaxconn=2048
> net.inet.icmp.drop_redirect=1
> net.inet.icmp.log_redirect=1
> net.inet.tcp.sendspace=32768
> net.inet.tcp.recvspace=32768
> vfs.vmiodirenable=1

No, it has nothing to do with them.

> Also, 2-3 days before the problems started occurring, I had
> created some additional swap space on both ad4 and ad6.  Those
> are the only lowlevel changes made to the system since it was built
> from the running sources.

This could be an issue.

> FreeBSD  4.4-RC FreeBSD 4.4-RC #0: Tue Aug 21 20:53:12 EDT 2001
> root@whiskey.klatsch.org:/usr/obj/usr/src/sys/WHISKEY  i386
>
> root@ [10:21pm][~]>>disklabel /dev/ad4s1
> <snip>
> 8 partitions:
> #        size   offset    fstype   [fsize bsize bps/cpg]
>   b:  2097152 117226242      swap                       # (Cyl. 7296*- 7427*)
>   c: 120053682        0    unused        0     0        # (Cyl.    0 - 7472*)
>   e: 117226242        0     vinum                       # (Cyl.    0 - 7296*)
>
> root@ [10:23pm][~]>>disklabel /dev/ad6s1
> <snip>
> 8 partitions:
> #        size   offset    fstype   [fsize bsize bps/cpg]
>   b:  2097152 117226242      swap                       # (Cyl. 7296*- 7427*)
>   c: 120053682        0    unused        0     0        # (Cyl.    0 - 7472*)
>   e: 117226242        0     vinum                       # (Cyl.    0 - 7296*)

That looks OK, however.  You didn't run fdisk?

> I'm happy to try anything that would assist you in tracking down the
> problem, and I can arrange for console access if it would be
> helpful.

OK, *assuming* that the data is still there in the same place, you
should be able to do:

 vinum -> setstate up stripe-mirror.p0  stripe-mirror.p0.s1 stripe-mirror.p1  stripe-mirror.p1.s1

This will set them into the "up" state.  Next do an fsck -n on the
plexes  (/dev/vinum/plex/stripe-mirror.p0 and
/dev/vinum/plex/stripe-mirror.p1).  Whatever you do, make sure you
don't write anything back to the disk (that's what -n does).  See how
bad fsck thinks the damage is.  Don't send me the whole output, just
the beginning.

Greg
--
When replying to this message, please copy the original recipients.
If you don't, I may ignore the reply.
For more information, see http://www.lemis.com/questions.html
See complete headers for address and phone numbers

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011026091455.A4706>