Date: Mon, 20 Mar 2000 01:31:32 -0500 (EST) From: cjohnson@camelot.com To: FreeBSD-gnats-submit@freebsd.org Subject: kern/17499: Can't revive VINUM RAID5 Message-ID: <20000320063132.3D74212C30@galahad.camelot.com>
next in thread | raw e-mail | index | archive | help
>Number: 17499 >Category: kern >Synopsis: Can't revive VINUM RAID5 >Confidential: no >Severity: serious >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sun Mar 19 22:40:01 PST 2000 >Closed-Date: >Last-Modified: >Originator: Christopher T. Johnson >Release: FreeBSD 4.0-RELEASE i386 >Organization: Paladin Software >Environment: SMP FreeBSD 4.0-RELEASE 128Mbytes of Memory Dual 233MMX Pentiums su-2.03# camcontrol devlist <QUANTUM FIREBALL ST4.3S 0F0C> at scbus0 target 0 lun 0 (pass0,da0) <EXABYTE EXB-8200 251K> at scbus0 target 4 lun 0 (pass1,sa0) <TOSHIBA CD-ROM XM-5701TA 0167> at scbus0 target 5 lun 0 (pass2,cd0) <IMS CDD2000/00 1.26> at scbus0 target 6 lun 0 (pass3,cd1) <FUJITSU M2954Q-512 0177> at scbus1 target 0 lun 0 (pass4,da1) <SEAGATE ST39140W 1281> at scbus1 target 1 lun 0 (pass5,da2) <SEAGATE ST43400N 1022> at scbus1 target 2 lun 0 (pass6,da3) <SEAGATE ST43400N 1028> at scbus1 target 3 lun 0 (pass7,da4) <QUANTUM FIREBALL_TM2110S 300X> at scbus1 target 5 lun 0 (pass8,da5) <SEAGATE ST43400N 1028> at scbus1 target 6 lun 0 (pass9,da6) su-2.03# vinum list 5 drives: D drv3 State: up Device /dev/da1s1e Avail: 1372/4149 MB (33%) D drv1 State: up Device /dev/da3s1a Avail: 0/2777 MB (0%) D drv2 State: up Device /dev/da4s1a Avail: 0/2777 MB (0%) D drv4 State: up Device /dev/da5s1e Avail: 0/2014 MB (0%) D drv5 State: up Device /dev/da6s1e Avail: 0/2776 MB (0%) 2 volumes: V quick State: up Plexes: 1 Size: 2013 MB V myraid State: up Plexes: 1 Size: 8329 MB 2 plexes: P quick.p0 C State: up Subdisks: 1 Size: 2013 MB P myraid.p0 R5 State: degraded Subdisks: 4 Size: 8329 MB 5 subdisks: S quick.p0.s0 State: up PO: 0 B Size: 2013 MB S myraid.p0.s0 State: up PO: 0 B Size: 2776 MB S myraid.p0.s1 State: up PO: 128 kB Size: 2776 MB S myraid.p0.s2 State: R 57% PO: 256 kB Size: 2776 MB *** Revive process for myraid.p0.s2 has died *** S myraid.p0.s3 State: up PO: 384 kB Size: 2776 MB >Description: After replacing a bad drive in my RAID5 I've been unable to get the SubDisk to revive. After reinstalling the disk drive I used "fdisk -I da1" to put a slice map on the drive and when disklabel -w -r da1s1 auto failed I used sysinstall to put a disklabel on the drive. Then I used disklabel -e da1s1 to add an "e" partition of type vinum (da1s1e) With that done I unmounted all my vinum drives and did "vinum stop" followed by "vinum start". Once vinum discovered the new drive, I used "vinum create" to set "drive name drv3 device /dev/da1s1e". At this point drv3 came up and I did "start myraid.p0.s2" to start the revive. After about a 30 minutes the vinum reported: can't revive myraid.p0.s2: Invalid argument When I tried to restart the revive with another "start myraid.p0.s2" the start locked up trying to get a "vlock" or "vrlock" (according to ctrl-t). I was unable to interupt vinum with ctrl-C or to suspend it with ctrl-Z or to kill it with a SIGKILL. After shutdown -r plus pressing the reset button. I did a low level format on the drive, reinstalled slice and partition maps. Started Vinum again and had same failure mode. This time a "vinum stop" and 'vinum start" got things running after the failure. The drive was then tested by" dd if=/dev/zero of=/dev/da1 bs=1024k This generated NO errors. Reinstalled slice/partitions and restart vinum. Start revive. Same failure at the samish point. Tried again with a different block size and had samefailure at the samish point After reading source for a while I checked for debug and when I found it was turned on, I turned on all the debug flags but for the "big drive" and "drop into the debugger". "debug 635(?)" then "start -w myraid.p0.s2". After verifying that it was logging I aborted the revive. set debug 0 and started reviving again. when we were close I stopped the revive. Set the debug flags and started the revive once more. This looks correct, dev 91,1 is /dev/vinum/myraid and the offset looks good and the length is correct also. the 13,131104 looks "ok" but I don't know if it is "right" Offset looks good, sd numbers are good. We read three "good drives" and write to the "reviving drive". Everything cool here. Mar 19 19:18:21 galahad /kernel: Read dev 91.1, offset 0x96c780, length 65536 Mar 19 19:18:21 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324280, devoffset 0x324389, length 65536 Mar 19 19:18:21 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324280, devoffset 0x324389, length 65536 Mar 19 19:18:21 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324280, devoffset 0x324389, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324280, devoffset 0x324389, length 65536 The next block looks good two. The offset goes up by 64K which is right. Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca00, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324300, devoffset 0x324409, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324300, devoffset 0x324409, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324300, devoffset 0x324409, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324300, devoffset 0x324409, length 65536 And again, everything looks good. Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca80, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324380, devoffset 0x324489, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324380, devoffset 0x324489, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324380, devoffset 0x324489, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324380, devoffset 0x324489, length 65536 And again Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce00, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324400, devoffset 0x324509, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324400, devoffset 0x324509, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324400, devoffset 0x324509, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324400, devoffset 0x324509, length 65536 One more good block. Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce80, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324480, devoffset 0x324589, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324480, devoffset 0x324589, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324480, devoffset 0x324589, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324480, devoffset 0x324589, length 65536 Where is the "Request: .*" log mesage? Where is the 91.1 Read request? 0x96cf00 looks like the next block to fix in the PLEX but why are we reading it for the SUBDISK? And where did this LENGTH come from! Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x96cf00, devoffset 0x96d009, length 2146172928 Mar 19 19:18:22 galahad /kernel: Read dev 13.131084, sd 3, offset 0x96cf00, devoffset 0x96d009, length 2146172928 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x96cf00, devoffset 0x96d009, length 2146172928 Mar 19 19:17:53 galahad vinum[963]: reviving myraid.p0.s2 Mar 19 19:18:15 galahad vinum[963]: can't revive myraid.p0.s2: Invalid argument >How-To-Repeat: I don't know if this will repeat anywhere else but I can offer ssh access to a vouched for player that wants to test on galahad. >Fix: >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000320063132.3D74212C30>