From owner-freebsd-stable Fri Jan 5 8: 5:24 2001 From owner-freebsd-stable@FreeBSD.ORG Fri Jan 5 08:05:17 2001 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from apoq.skynet.be (apoq.skynet.be [195.238.2.35]) by hub.freebsd.org (Postfix) with ESMTP id 6097437B402 for ; Fri, 5 Jan 2001 08:05:17 -0800 (PST) Received: from [172.17.1.121] (warp-core.skynet.be [195.238.2.25]) by apoq.skynet.be (Postfix) with ESMTP id 21FDD9C9B; Fri, 5 Jan 2001 17:05:03 +0100 (MET) Mime-Version: 1.0 X-Sender: blk@pop.skynet.be Message-Id: Date: Fri, 5 Jan 2001 17:04:47 +0100 To: Greg Lehey From: Brad Knowles Subject: Problems with corrupted vinum devices... Cc: Andy De Petter , FreeBSD-STABLE Mailing List Content-Type: text/plain; charset="us-ascii" ; format="flowed" Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Folks, I've read the man page, the vinum debugging page at , searched the archives of the stable mailing list, and still haven't found anything that would appear to address the situation we've just started experiencing. Many months ago, we had a Diablo USENET news spool/reader server that was up and operating fine, taking a slave feed from another news spool server, until such time as I was able to get separate reader-only servers working. I had separate /usr/news/history and /usr/news/spool filesystems set up on their own vinum RAID 1+0 volumes (striped and mirrored across slices on each of the four logical units that were exported from the drive array), as well as a set of four spool directories on their own "raw" slices of those same logical units. The hardware configuration is a Dell PowerEdge 1300 with dual 450Mhz processors with 1GB of ECC RAM, running 4.1-STABLE (I last did a make update on Aug 8 16:27, according to the logfiles I keep), with an IBM DDRS-39130D DC2A drive as the boot disk. No changes (that I know of) have been made to the system sources since this time, including vinum. Unfortunately, we ran into problems with the mail system, and had to "borrow" half of the drives in the external drive array that was attached to the news spool server. I copied the data from the spool directories on the two logical units that would be disappearing to the other pair of spool directories that would be remaining, and then I broke the mirrors and pulled out half the drives and moved them over to the other drive array we have on the mail system. I was very careful to pull out drives associated with only the two logical units I had copied the data from, and that this would not cause any stripes to be broken. Fast forward to the next millenium, and we finally have the replacements for the disks in place, and we're starting the process of trying to revive this machine. Even though we created the slices and the partitions correctly, we still weren't able to get vinum to recognize the restored disk devices correctly, and I wasn't aware of the page at . So, instead, I just copied all the data from the two filesystems that had been mirrored and striped, and then did a "resetconfig". I then re-created the vinum configuration with exactly the same parameters as I had used the first time, and was able to get vinum to init and bring online the "restored" disk devices and rebuild the mirrors, etc.... We then put filesystems on everything, and copied all the files back to their correct locations. So far as we could tell, this seemed to have worked perfectly, and we were done. However, at some point within the next couple of hours (we're not quite sure when), the machine crashed, and vinum failed to come up on the reboot. We had to comment out the /etc/fstab entries that referred to the vinum volumes, and since then have been trying to debug the problems as to why vinum can't find or start any devices at all. Following the instructions at , I tried pulling the file log information as specified. The exact command I used was (line-wrapped for readability): $ for i in /dev/da2s1e /dev/da2s2e /dev/da3s1e \ /dev/da3s2e /dev/da4s1e /dev/da4s2e /dev/da5s1e \ /dev/da5s2e; do (dd if=$i skip=8 count=6|tr -d \ '\000-\011\200-\377'; echo) >> brad.log; done The first two disk devices (da2 & da3) produced gibberish. The last two disk devices (da4 & da5) produced reasonably intelligible output, but instead of "IN VINO" as the first seven characters, I got "NO VINO" instead. A compressed version of the log file available is via my personal anonymous ftp space -- see . Yes, I know this file is trivially small, but if I compress it then I can make sure that no one attempts to transfer it in ASCII mode, and that if they do then they get what they deserve. ;-) From the time when I did the resetconfig until now, here is the contents of /var/log/vinum_history: 5 Jan 2001 12:36:58.476142 resetconfig 5 Jan 2001 12:37:04.663644 *** Created devices *** 5 Jan 2001 12:37:10.622895 quit 5 Jan 2001 12:37:19.576839 *** vinum started *** 5 Jan 2001 12:37:19.577725 create -f /etc/vinum.conf drive d1 device /dev/da2s1e drive d2 device /dev/da3s1e drive d3 device /dev/da4s1e drive d4 device /dev/da5s1e drive d5 device /dev/da2s2e drive d6 device /dev/da3s2e drive d7 device /dev/da4s2e drive d8 device /dev/da5s2e volume history plex org striped 32m sd length 0 drive d1 sd length 0 drive d2 plex org striped 32m sd length 0 drive d3 sd length 0 drive d4 volume spool plex org striped 32m sd length 0 drive d8 sd length 0 drive d7 plex org striped 32m sd length 0 drive d6 sd length 0 drive d5 5 Jan 2001 12:37:19.659734 *** Created devices *** 5 Jan 2001 12:37:42.812629 *** vinum started *** 5 Jan 2001 12:37:44.711438 ls -v 5 Jan 2001 12:37:53.114029 start d6 5 Jan 2001 12:37:59.311274 start d5 5 Jan 2001 12:38:07.602677 ls -v 5 Jan 2001 12:38:13.014144 start spool.p1.s1 5 Jan 2001 12:38:26.887609 exit 5 Jan 2001 12:38:28.403601 quit 5 Jan 2001 12:41:32.279534 *** vinum started *** 5 Jan 2001 12:41:33.542050 ls -v 5 Jan 2001 12:41:46.965537 initialize spool.p1.s0 5 Jan 2001 12:41:49.561185 init spool.p1.s0 5 Jan 2001 12:41:54.304397 ls -v 5 Jan 2001 12:42:11.275776 init history.p1.s0 5 Jan 2001 12:42:12.599953 init history.p1.s1 5 Jan 2001 12:42:16.010292 ls -v 5 Jan 2001 12:48:07.540908 *** vinum started *** 5 Jan 2001 12:48:08.955230 ls -v 5 Jan 2001 12:52:45.324897 ls -v 5 Jan 2001 12:38:13.014144 start spool.p1.s1 5 Jan 2001 13:00:18.560459 *** vinum started *** 5 Jan 2001 13:00:20.138779 ls -v 5 Jan 2001 13:00:47.388334 *** vinum started *** 5 Jan 2001 13:00:48.383848 ls -v 5 Jan 2001 16:32:55.861137 *** vinum started *** 5 Jan 2001 16:32:55.869648 start 5 Jan 2001 15:35:02.271176 *** vinum started *** 5 Jan 2001 15:35:03.369950 ls 5 Jan 2001 15:35:04.178968 ls -v 5 Jan 2001 15:36:18.528618 *** vinum started *** 5 Jan 2001 15:36:21.619565 start 5 Jan 2001 15:36:46.720739 ls -v 5 Jan 2001 15:36:49.123128 ls -V 5 Jan 2001 15:37:00.724413 ls -p 5 Jan 2001 15:37:03.958826 ls -P 5 Jan 2001 15:39:24.212310 *** vinum started *** 5 Jan 2001 15:39:24.213161 start 5 Jan 2001 15:53:52.092618 *** vinum started *** 5 Jan 2001 15:53:53.778854 start 5 Jan 2001 15:54:01.699490 quit 5 Jan 2001 16:07:19.846860 *** vinum started *** 5 Jan 2001 16:07:19.847727 makedev 5 Jan 2001 16:07:19.848473 *** Created devices *** 5 Jan 2001 16:07:25.074358 *** vinum started *** 5 Jan 2001 16:07:26.330688 start 5 Jan 2001 16:08:27.987415 *** vinum started *** 5 Jan 2001 16:08:27.988301 read /dev/da2s1e From the time of the reboot ending with the time vinum was unable to read the last device, in /var/log/messages I find: Jan 5 15:53:34 baloo /BALOO: SMP: AP CPU #1 Launched! Jan 5 15:53:34 baloo /BALOO: Waiting 5 seconds for SCSI devices to settle Jan 5 15:53:34 baloo /BALOO: da2 at ahc0 bus 0 target 6 lun 0 Jan 5 15:53:34 baloo /BALOO: da2: Fixed Direct Access SCSI-2 device Jan 5 15:53:34 baloo /BALOO: da2: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled Jan 5 15:53:34 baloo /BALOO: da2: 68235MB (139745280 512 byte sectors: 255H 63S/T 8698C) Jan 5 15:53:34 baloo /BALOO: da4 at ahc1 bus 0 target 5 lun 0 Jan 5 15:53:34 baloo /BALOO: da4: Fixed Direct Access SCSI-2 device Jan 5 15:53:34 baloo /BALOO: da4: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled Jan 5 15:53:34 baloo /BALOO: da4: 68235MB (139745280 512 byte sectors: 255H 63S/T 8698C) Jan 5 15:53:34 baloo /BALOO: da3 at ahc0 bus 0 target 6 lun 1 Jan 5 15:53:34 baloo /BALOO: da3: Fixed Direct Access SCSI-2 device Jan 5 15:53:34 baloo /BALOO: da3: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled Jan 5 15:53:34 baloo /BALOO: da3: 68235MB (139745280 512 byte sectors: 255H 63S/T 8698C) Jan 5 15:53:34 baloo /BALOO: da5 at ahc1 bus 0 target 5 lun 1 Jan 5 15:53:34 baloo /BALOO: da5: Fixed Direct Access SCSI-2 device Jan 5 15:53:34 baloo /BALOO: da5: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled Jan 5 15:53:34 baloo /BALOO: da5: 68235MB (139745280 512 byte sectors: 255H 63S/T 8698C) Jan 5 15:53:34 baloo /BALOO: Mounting root from ufs:/dev/da1s1a Jan 5 15:53:34 baloo /BALOO: da1 at ahc2 bus 0 target 1 lun 0 Jan 5 15:53:34 baloo /BALOO: da1: Fixed Direct Access SCSI-2 device Jan 5 15:53:34 baloo /BALOO: da1: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled Jan 5 15:53:34 baloo /BALOO: da1: 8715MB (17850000 512 byte sectors: 255H 63S/T 1111C) Jan 5 15:53:34 baloo /BALOO: vinum: loaded Jan 5 15:53:34 baloo /BALOO: vinum: reading configuration from /dev/da5s2e Jan 5 15:53:34 baloo /BALOO: vinum_scandisk: /dev/da5s2e is down Jan 5 15:53:34 baloo /BALOO: vinum: Can't read device /dev/da5s2e, error 5 Jan 5 15:53:34 baloo /BALOO: vinum: updating configuration from /dev/da2s2e Jan 5 15:53:34 baloo /BALOO: vinum_scandisk: /dev/da2s2e is down Jan 5 15:53:34 baloo /BALOO: vinum: Can't read device /dev/da2s2e, error 5 Jan 5 15:53:34 baloo /BALOO: vinum: updating configuration from /dev/da3s1e Jan 5 15:53:34 baloo /BALOO: vinum_scandisk: /dev/da3s1e is down Jan 5 15:53:34 baloo /BALOO: vinum: Can't read device /dev/da3s1e, error 5 Jan 5 15:53:34 baloo /BALOO: vinum: updating configuration from /dev/da3s2e Jan 5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da3s2e is down Jan 5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da3s2e, error 5 Jan 5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da4s1e Jan 5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da4s1e is down Jan 5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da4s1e, error 5 Jan 5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da4s2e Jan 5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da4s2e is down Jan 5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da4s2e, error 5 Jan 5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da5s1e Jan 5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da5s1e is down Jan 5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da5s1e, error 5 Jan 5 15:53:35 baloo /BALOO: vinum: updating configuration from /dev/da2s1e Jan 5 15:53:35 baloo /BALOO: vinum_scandisk: /dev/da2s1e is down Jan 5 15:53:35 baloo /BALOO: vinum: Can't read device /dev/da2s1e, error 5 Jan 5 15:53:35 baloo /BALOO: vinum: couldn't read configuration Unfortunately, I'm at a loss as to understand where to go next in terms of debugging this matter. Any and all help you can provide is appreciated. Thanks! -- These are my opinions -- not to be taken as official Skynet policy ====================================================================== Brad Knowles, || Belgacom Skynet SA/NV Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124 Phone/Fax: +32-2-706.13.11/12.49 || B-1140 Brussels http://www.skynet.be || Belgium "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message