Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Jan 2001 17:04:47 +0100
From:      Brad Knowles <blk@skynet.be>
To:        Greg Lehey <grog@lemis.com>
Cc:        Andy De Petter <andy.depetter@ops.skynet.be>, FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>
Subject:   Problems with corrupted vinum devices...
Message-ID:  <v0422080ab67b97176227@[172.17.1.121]>

next in thread | raw e-mail | index | archive | help
Folks,

	I've read the man page, the vinum debugging page at 
<http://www.vinumvm.org/vinum/how-to-debug.html>, searched the 
archives of the stable mailing list, and still haven't found anything 
that would appear to address the situation we've just started 
experiencing.


	Many months ago, we had a Diablo USENET news spool/reader server 
that was up and operating fine, taking a slave feed from another news 
spool server, until such time as I was able to get separate 
reader-only servers working.

	I had separate /usr/news/history and /usr/news/spool filesystems 
set up on their own vinum RAID 1+0 volumes (striped and mirrored 
across slices on each of the four logical units that were exported 
from the drive array), as well as a set of four spool directories on 
their own "raw" slices of those same logical units.

	The hardware configuration is a Dell PowerEdge 1300 with dual 
450Mhz processors with 1GB of ECC RAM, running 4.1-STABLE (I last did 
a make update on Aug  8 16:27, according to the logfiles I keep), 
with an IBM DDRS-39130D DC2A drive as the boot disk.  No changes 
(that I know of) have been made to the system sources since this 
time, including vinum.


	Unfortunately, we ran into problems with the mail system, and had 
to "borrow" half of the drives in the external drive array that was 
attached to the news spool server.  I copied the data from the spool 
directories on the two logical units that would be disappearing to 
the other pair of spool directories that would be remaining, and then 
I broke the mirrors and pulled out half the drives and moved them 
over to the other drive array we have on the mail system.

	I was very careful to pull out drives associated with only the 
two logical units I had copied the data from, and that this would not 
cause any stripes to be broken.


	Fast forward to the next millenium, and we finally have the 
replacements for the disks in place, and we're starting the process 
of trying to revive this machine.  Even though we created the slices 
and the partitions correctly, we still weren't able to get vinum to 
recognize the restored disk devices correctly, and I wasn't aware of 
the page at <http://www.vinumvm.org/vinum/replacing-drive.html>.

	So, instead, I just copied all the data from the two filesystems 
that had been mirrored and striped, and then did a "resetconfig".  I 
then re-created the vinum configuration with exactly the same 
parameters as I had used the first time, and was able to get vinum to 
init and bring online the "restored" disk devices and rebuild the 
mirrors, etc....

	We then put filesystems on everything, and copied all the files 
back to their correct locations.  So far as we could tell, this 
seemed to have worked perfectly, and we were done.


	However, at some point within the next couple of hours (we're not 
quite sure when), the machine crashed, and vinum failed to come up on 
the reboot.  We had to comment out the /etc/fstab entries that 
referred to the vinum volumes, and since then have been trying to 
debug the problems as to why vinum can't find or start any devices at 
all.

	Following the instructions at 
<http://www.vinumvm.org/vinum/how-to-debug.html#config-problems>, I 
tried pulling the file log information as specified.  The exact 
command I used was (line-wrapped for readability):

		$ for i in /dev/da2s1e /dev/da2s2e /dev/da3s1e \
		/dev/da3s2e /dev/da4s1e /dev/da4s2e /dev/da5s1e \
		/dev/da5s2e; do (dd if=$i skip=8 count=6|tr -d \
		'\000-\011\200-\377'; echo) >> brad.log; done

	The first two disk devices (da2 & da3) produced gibberish.  The 
last two disk devices (da4 & da5) produced reasonably intelligible 
output, but instead of "IN VINO" as the first seven characters, I got 
"NO VINO" instead.

	A compressed version of the log file available is via my personal 
anonymous ftp space -- see 
<ftp://ftp.shub-internet.org/pub/shub/brad.log.gz>.  Yes, I know this 
file is trivially small, but if I compress it then I can make sure 
that no one attempts to transfer it in ASCII mode, and that if they 
do then they get what they deserve.  ;-)


	From the time when I did the resetconfig until now, here is the 
contents of /var/log/vinum_history:

  5 Jan 2001 12:36:58.476142 resetconfig
  5 Jan 2001 12:37:04.663644 *** Created devices ***
  5 Jan 2001 12:37:10.622895 quit
  5 Jan 2001 12:37:19.576839 *** vinum started ***
  5 Jan 2001 12:37:19.577725 create -f /etc/vinum.conf
drive d1 device /dev/da2s1e
drive d2 device /dev/da3s1e
drive d3 device /dev/da4s1e
drive d4 device /dev/da5s1e
drive d5 device /dev/da2s2e
drive d6 device /dev/da3s2e
drive d7 device /dev/da4s2e
drive d8 device /dev/da5s2e

volume history
         plex org striped 32m
                 sd length 0 drive d1
                 sd length 0 drive d2
         plex org striped 32m
                 sd length 0 drive d3
                 sd length 0 drive d4

volume spool
         plex org striped 32m
                 sd length 0 drive d8
                 sd length 0 drive d7
         plex org striped 32m
                 sd length 0 drive d6
                 sd length 0 drive d5
  5 Jan 2001 12:37:19.659734 *** Created devices ***
  5 Jan 2001 12:37:42.812629 *** vinum started ***
  5 Jan 2001 12:37:44.711438 ls -v
  5 Jan 2001 12:37:53.114029 start d6
  5 Jan 2001 12:37:59.311274 start d5
  5 Jan 2001 12:38:07.602677 ls -v
  5 Jan 2001 12:38:13.014144 start spool.p1.s1
  5 Jan 2001 12:38:26.887609 exit
  5 Jan 2001 12:38:28.403601 quit
  5 Jan 2001 12:41:32.279534 *** vinum started ***
  5 Jan 2001 12:41:33.542050 ls -v
  5 Jan 2001 12:41:46.965537 initialize spool.p1.s0
  5 Jan 2001 12:41:49.561185 init spool.p1.s0
  5 Jan 2001 12:41:54.304397 ls -v
  5 Jan 2001 12:42:11.275776 init history.p1.s0
  5 Jan 2001 12:42:12.599953 init history.p1.s1
  5 Jan 2001 12:42:16.010292 ls -v
  5 Jan 2001 12:48:07.540908 *** vinum started ***
  5 Jan 2001 12:48:08.955230 ls -v
  5 Jan 2001 12:52:45.324897 ls -v
  5 Jan 2001 12:38:13.014144 start spool.p1.s1
  5 Jan 2001 13:00:18.560459 *** vinum started ***
  5 Jan 2001 13:00:20.138779 ls -v
  5 Jan 2001 13:00:47.388334 *** vinum started ***
  5 Jan 2001 13:00:48.383848 ls -v
  5 Jan 2001 16:32:55.861137 *** vinum started ***
  5 Jan 2001 16:32:55.869648 start
  5 Jan 2001 15:35:02.271176 *** vinum started ***
  5 Jan 2001 15:35:03.369950 ls
  5 Jan 2001 15:35:04.178968 ls -v
  5 Jan 2001 15:36:18.528618 *** vinum started ***
  5 Jan 2001 15:36:21.619565 start
  5 Jan 2001 15:36:46.720739 ls -v
  5 Jan 2001 15:36:49.123128 ls -V
  5 Jan 2001 15:37:00.724413 ls -p
  5 Jan 2001 15:37:03.958826 ls -P
  5 Jan 2001 15:39:24.212310 *** vinum started ***
  5 Jan 2001 15:39:24.213161 start
  5 Jan 2001 15:53:52.092618 *** vinum started ***
  5 Jan 2001 15:53:53.778854 start
  5 Jan 2001 15:54:01.699490 quit
  5 Jan 2001 16:07:19.846860 *** vinum started ***
  5 Jan 2001 16:07:19.847727 makedev
  5 Jan 2001 16:07:19.848473 *** Created devices ***
  5 Jan 2001 16:07:25.074358 *** vinum started ***
  5 Jan 2001 16:07:26.330688 start
  5 Jan 2001 16:08:27.987415 *** vinum started ***
  5 Jan 2001 16:08:27.988301 read /dev/da2s1e

	From the time of the reboot ending with the time vinum was unable 
to read the last device, in /var/log/messages I find:

Jan  5 15:53:34 baloo /BALOO: SMP: AP CPU #1 Launched!
Jan  5 15:53:34 baloo /BALOO: Waiting 5 seconds for SCSI devices to settle
Jan  5 15:53:34 baloo /BALOO: da2 at ahc0 bus 0 target 6 lun 0
Jan  5 15:53:34 baloo /BALOO: da2: <HITACHI DF400 0000> Fixed Direct 
Access SCSI-2 device
Jan  5 15:53:34 baloo /BALOO: da2: 80.000MB/s transfers (40.000MHz, 
offset 15, 16bit), Tagged Queueing Enabled
Jan  5 15:53:34 baloo /BALOO: da2: 68235MB (139745280 512 byte 
sectors: 255H 63S/T 8698C)
Jan  5 15:53:34 baloo /BALOO: da4 at ahc1 bus 0 target 5 lun 0
Jan  5 15:53:34 baloo /BALOO: da4: <HITACHI DF400 0000> Fixed Direct 
Access SCSI-2 device
Jan  5 15:53:34 baloo /BALOO: da4: 80.000MB/s transfers (40.000MHz, 
offset 15, 16bit), Tagged Queueing Enabled
Jan  5 15:53:34 baloo /BALOO: da4: 68235MB (139745280 512 byte 
sectors: 255H 63S/T 8698C)
Jan  5 15:53:34 baloo /BALOO: da3 at ahc0 bus 0 target 6 lun 1
Jan  5 15:53:34 baloo /BALOO: da3: <HITACHI DF400 0000> Fixed Direct 
Access SCSI-2 device
Jan  5 15:53:34 baloo /BALOO: da3: 80.000MB/s transfers (40.000MHz, 
offset 15, 16bit), Tagged Queueing Enabled
Jan  5 15:53:34 baloo /BALOO: da3: 68235MB (139745280 512 byte 
sectors: 255H 63S/T 8698C)
Jan  5 15:53:34 baloo /BALOO: da5 at ahc1 bus 0 target 5 lun 1
Jan  5 15:53:34 baloo /BALOO: da5: <HITACHI DF400 0000> Fixed Direct 
Access SCSI-2 device
Jan  5 15:53:34 baloo /BALOO: da5: 80.000MB/s transfers (40.000MHz, 
offset 15, 16bit), Tagged Queueing Enabled
Jan  5 15:53:34 baloo /BALOO: da5: 68235MB (139745280 512 byte 
sectors: 255H 63S/T 8698C)
Jan  5 15:53:34 baloo /BALOO: Mounting root from ufs:/dev/da1s1a
Jan  5 15:53:34 baloo /BALOO: da1 at ahc2 bus 0 target 1 lun 0
Jan  5 15:53:34 baloo /BALOO: da1: <IBM DDRS-39130D DC2A> Fixed 
Direct Access SCSI-2 device
Jan  5 15:53:34 baloo /BALOO: da1: 80.000MB/s transfers (40.000MHz, 
offset 15, 16bit), Tagged Queueing Enabled
Jan  5 15:53:34 baloo /BALOO: da1: 8715MB (17850000 512 byte sectors: 
255H 63S/T 1111C)
Jan  5 15:53:34 baloo /BALOO: vinum: loaded
Jan  5 15:53:34 baloo /BALOO: vinum: reading configuration from /dev/da5s2e
Jan  5 15:53:34 baloo /BALOO: vinum_scandisk: /dev/da5s2e is down
Jan  5 15:53:34 baloo /BALOO: vinum: Can't read device /dev/da5s2e, error 5
Jan  5 15:53:34 baloo /BALOO: vinum: updating configuration from /dev/da2s2e
Jan  5 15:53:34 baloo /BALOO: vinum_scandisk: /dev/da2s2e is down
Jan  5 15:53:34 baloo /BALOO: vinum: Can't read device /dev/da2s2e, error 5
Jan  5 15:53:34 baloo /BALOO: vinum: updating configuration from /dev/da3s1e
Jan  5 15:53:34 baloo /BALOO: vinum_scandisk: /dev/da3s1e is down
Jan  5 15:53:34 baloo /BALOO: vinum: Can't read device /dev/da3s1e, error 5
Jan  5 15:53:34 baloo /BALOO: vinum: updating configuration from /dev/da3s2e
Jan  5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da3s2e is down
Jan  5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da3s2e, error 5
Jan  5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da4s1e
Jan  5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da4s1e is down
Jan  5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da4s1e, error 5
Jan  5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da4s2e
Jan  5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da4s2e is down
Jan  5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da4s2e, error 5
Jan  5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da5s1e
Jan  5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da5s1e is down
Jan  5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da5s1e, error 5
Jan  5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da2s1e
Jan  5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da2s1e is down
Jan  5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da2s1e, error 5
Jan  5 15:53:35 baloo /BALOO: vinum: couldn't read configuration


	Unfortunately, I'm at a loss as to understand where to go next in 
terms of debugging this matter.  Any and all help you can provide is 
appreciated.

	Thanks!

--
   These are my opinions -- not to be taken as official Skynet policy
======================================================================
Brad Knowles, <blk@skynet.be>                || Belgacom Skynet SA/NV
Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124
Phone/Fax: +32-2-706.13.11/12.49             || B-1140 Brussels
http://www.skynet.be                         || Belgium

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
     -Benjamin Franklin, Historical Review of Pennsylvania.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?v0422080ab67b97176227>