Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Jun 1999 14:55:58 +0100 (BST)
From:      Kiril Mitev <kiril@ideaglobal.com>
To:        grog@folly.lemis.com (Greg Lehey)
Cc:        kiril@ideaglobal.com, Cy.Schubert@uumail.gov.bc.ca, freebsd-stable@FreeBSD.ORG
Subject:   Re: vinum disk has gone AWOL, help!
Message-ID:  <199906111355.OAA17035@ideaglobal.com>
In-Reply-To: <19990610162145.22313@folly.lemis.com> from "Greg Lehey" at Jun 10, 99 04:21:45 pm

next in thread | previous in thread | raw e-mail | index | archive | help
OK :-) I'll try to give the "big picture", see below for details...

> 
> On Friday,  4 June 1999 at 21:58:07 +0100, Kiril Mitev wrote:
> >> In message <199906041920.UAA17798@ideaglobal.com>, Kiril Mitevv writes:
> >>> (Sorry if this is more appropriate for -questions...)
> >>>
> >>>
> >>> after my last reboot, which was NOT a panic or anything like that
> >>> my vinum volume sort of disappeared...
> >>>
> >>> vinum itself is still happy, and a listing shows all my
> >>> bits & pieces, up to the volume level as OK.
> >>>
> >> If that fails, try fsck -b <another number> <filesystem>.  To get
> >> the other block numbers for alternate superblocks, use newfs -N,
> >> which will go through all of the motions of creating a filesystem,
> >> e.g. print the superblock numbers, without actually creating a
> >> filesystem.  If this fails, you're pretty much hosed.
> >
> > That looks like it might actually fix it, unless there
> > more corruption somewhere
> 
> This sounds funny.  If you have any evidence that it was caused by
> Vinum, I'd be very interested to see it.  But I suspect that the cause
> is elsewhere, and Vinum has little to do with it.
> 
> Greg
> --

The evidence (as they say) is "purely circumstantial"...

Let me explain the h/w setup first. The box in question has a 
dual P2 MB with the built-in Adaptec 7890 2xUW SCSI
(7895? something like that)

The "primary" scsi channel has a UW side and an "old-fashioned" side
the "secondary" channel has only a UW bus, for a total of 3 connectors
on the MB.

The boot-critical stuff (/,/usr,/var) is located on an IDE disk, due 
to previous painful experiences with SCSI weirdness. Obviously this is 
the boot disk, and the Adaptec is configured NOT to setup BIOS disk 
devices.


scsi 1 UW has 3 nice 9gb disks hanging off it, with scsi id's of
0,1,2. these disks are located inside the PC itself, with disk id 2
being the closest to the controller, and disk id 0 furthest away, last
on that bus and terminated.
these disks map to da0,da1 and da2 in F-BSD

scsi 2 UW has 2 more 4gb disks on it, with scsi id's of 4 and 5, disk5 
is last on the bus and terminated.
these disks map to da3 and da4

/,/usr,/var are on a single IDE disk.
da0 is mounted on /home as a normal disk
da1,2,3,4 are bunched together into a 25gb vinum volume

with me so far ? OK...
the original problem appeared when I tried to plug in a scsi tape 
to backup my stuff, since I was intending to upgrade from 3.1 to 3.2

the tape drive, scsi id 6 was connected to the "slow" connector 
on the primary scsi, and there was an active terminator after the tape
on that bus.

the very first reboot the scsi disks went into a "device in timeout,
device not in timeout loop" which got nowhere, so i had to hit the 
good ole reset button and start playing with the box.

(remember that the tape drive shares a bus with 1 non-vinum and 2 vinum
disks? good)

my first thought was that the termination got screwed, so i spent
quite a few hours testing the various termination options on the
devices and on the controller - same result every time (well, almost
every time. in some configurations,, the controller lost the disks
partially or completely)

just for the fun of it, put the tape drive on different scsi id - 
no change

further investigation showed that the eternal timeout loop occurs 
on the 2 vinum disks that shared the bus with the tape. the test
pattern was (more or less) like this...


boot -s, fsck /home - no problem, reboot
boot -s, start vinum, fsck vinum volume - timeout & hang, reboot
boot -s, start vinum,  dd if=/dev/da1 of=/dev/null count=10 ( a vinum disk )
         - timeout/hang, reboot 
boot -s, DONT start vinum, dd from the non-vinum disk, no problem, reboot
boot -s, DONT start vinum, dd from vinum disk - no problem, reboot
boot -s, start vinum, dd from non-vinum disk - no problem, reboot

(start vinum means running the following commands:
 cd /etc/
 . ./rc.conf
 vinum read $vinum_drives
)

test the above with a non-SMP kernel - identical results...

I finally gave up and did the backup by partially copying from the 
vinum disk to the non-vinum one with the tape unpluggeed, rebooting
into single-user mode with the tape plugged in and backing up.
{repeat 6 times :-)))) } 
it wasn't as bad as it sounds, since a single
tape would not have held the full disk anyway...

updated the OS without a hitch (kudos to the developers, btw), so 
i did not need those backups after all :-))

^^^^^ that was 3.1 ^^^^^^^
vvvvv this is 3.2 vvvvvv

a couple of weeks later I decided to test a cd-burner on the box,
so i plugged one in - in exactly the same spot both in terms of 
physical connections and scis id's as i had the tape drive.

i was quite appehensive that the same sort of thing might happen
so i made sure i boot into single-user mode first, which went ok,
played with the burner a bit, then decided to reboot into multi
user to see what happens...

as soon as the vinum volume came up for fsck, I got the scsi errors and
that very nasty fsck error, panicked and took out the burner.

the next reboot did not have the scsi timeouts. neither did it have
a valid partition, which caused a lot of stress until someone pointed out
to me (very politely, though i did not deserve it in retrospect :-))))
that I can use an alternative super block for fsck ....

happy end of story.

but to answer your question - no, I cannot guarantee that it was
vinum's fault, but I _think_ I have eliminated all other variables

i am also (understandably, i hope) rather hesitant to do any more
playing around with the hardware.... but i can run a few tests if you
can tell me what to do 


Kiril




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199906111355.OAA17035>