Date: Sat, 31 Mar 2007 13:11:16 -0600 From: "Rick C. Petty" <rick-freebsd@kiwi-computer.com> To: freebsd-geom@freebsd.org Subject: gvinum rebuildparity breakage Message-ID: <20070331191116.GA87710@keira.kiwi-computer.com>
next in thread | raw e-mail | index | archive | help
Due to ata driver problems, one of the disks which holds part of my RAID5 volumes was dropped (see below). gvinum did the right thing and reported the drive lost and all the relevent subdisks in my RAID5 plexes were changed to degraded. Problem 1). The ata driver would not bring up the dropped drive. Numerous attempts at "atacontrol reinit" and detach/attach sequences would not bring the drive back. I've reported this problem in the past with this Promise PDC40718 SATA300 controller card and have not seen or heard of a resolution. My guess is the ata folks blame it on the Promise chip, fine. Problem 2). The geom subsystem continued to spit out error messages yet fail to allow the volumes be unmounted. I've yet to witness geom do anything gracefully such as allow the mountpoints to be forcibly unmounted (without panicking) or perform a shutdown without dirty buffers. Fine. Problems 1 & 2 required a reboot; so much for my 115-day uptime. I'll accept this, because I expect gvinum to do the Right Thing... which brings me to: Problem 3). I have four volumes which have degraded RAID5 plexes, as is expected from the dropped drive. However the drive is up now but gvinum won't let me rebuild the parity: # gvinum rebuildparity music.p0 gvinum: plex music.p0 is not completely accessible What a strange message, unless it's referring to the stale subdisk, but it won't let me do anything about that either: # gvinum start music.p0.s0 gvinum: can't start: cannot start 'music.p0.s0' - not yet supported I'm not keen to try forcibly setting the state with "gvinum setstate -f" because: the man page states this is for diagnostic purposes only, and the man page doesn't give a list of possible states anyway. This documentation has been lacking since the 5.x nerf of vinum/gvinum anyway so the man page and "gvinum help" are quite useless (Problem #4). I'd rather not experiment with this command since the volumes are important, so I'm left without options. So, how do I rebuild by RAID5 volumes? This used to work fine in pre-geom vinum and I'm pretty sure this worked in gvinum at one point. I'm running 6.1-STABLE (RELENG_6 as of 2006-Jul-21) with a GENERIC kernel. Here is my vinum configuration: # gvinum l 8 drives: D k0 State: up /dev/ad18 A: 92253/305245 MB (30%) D k3 State: up /dev/ad16 A: 0/305245 MB (0%) D k1 State: up /dev/ad14 A: 92253/305245 MB (30%) D k2 State: up /dev/ad12 A: 0/305245 MB (0%) D k4 State: up /dev/ad10 A: 77036/286188 MB (26%) D k7 State: up /dev/ad8 A: 77036/286188 MB (26%) D k5 State: up /dev/ad6 A: 77036/286188 MB (26%) D k6 State: up /dev/ad4 A: 77036/286188 MB (26%) 10 volumes: V extra State: up Plexes: 1 Size: 524 GB V 3rd-party State: up Plexes: 2 Size: 16 GB V home State: up Plexes: 2 Size: 32 GB V projects State: up Plexes: 2 Size: 4096 MB V bsd State: up Plexes: 2 Size: 128 GB V media State: up Plexes: 2 Size: 64 GB V picts State: up Plexes: 1 Size: 768 MB V music State: up Plexes: 1 Size: 120 GB V flac State: up Plexes: 1 Size: 192 GB V video State: up Plexes: 1 Size: 300 GB 15 plexes: P extra.p0 S State: up Subdisks: 2 Size: 524 GB P 3rd-party.p0 C State: up Subdisks: 1 Size: 16 GB P 3rd-party.p1 C State: up Subdisks: 1 Size: 16 GB P home.p1 C State: up Subdisks: 1 Size: 32 GB P home.p0 C State: up Subdisks: 1 Size: 32 GB P projects.p1 C State: up Subdisks: 1 Size: 4096 MB P projects.p0 C State: up Subdisks: 1 Size: 4096 MB P bsd.p1 C State: up Subdisks: 1 Size: 128 GB P bsd.p0 C State: up Subdisks: 1 Size: 128 GB P media.p1 C State: up Subdisks: 1 Size: 64 GB P media.p0 C State: up Subdisks: 1 Size: 64 GB P picts.p0 R5 State: degraded Subdisks: 4 Size: 768 MB P music.p0 R5 State: degraded Subdisks: 4 Size: 120 GB P flac.p0 R5 State: degraded Subdisks: 4 Size: 192 GB P video.p0 R5 State: degraded Subdisks: 4 Size: 300 GB 28 subdisks: S extra.p0.s1 State: up D: k3 Size: 262 GB S extra.p0.s0 State: up D: k2 Size: 262 GB S 3rd-party.p0.s0 State: up D: k0 Size: 16 GB S 3rd-party.p1.s0 State: up D: k1 Size: 16 GB S home.p1.s0 State: up D: k3 Size: 32 GB S home.p0.s0 State: up D: k2 Size: 32 GB S projects.p1.s0 State: up D: k3 Size: 4096 MB S projects.p0.s0 State: up D: k2 Size: 4096 MB S bsd.p1.s0 State: up D: k1 Size: 128 GB S bsd.p0.s0 State: up D: k0 Size: 128 GB S media.p1.s0 State: up D: k1 Size: 64 GB S media.p0.s0 State: up D: k0 Size: 64 GB S picts.p0.s0 State: stale D: k4 Size: 256 MB S picts.p0.s1 State: up D: k5 Size: 256 MB S picts.p0.s2 State: up D: k6 Size: 256 MB S picts.p0.s3 State: up D: k7 Size: 256 MB S music.p0.s0 State: stale D: k4 Size: 40 GB S music.p0.s1 State: up D: k5 Size: 40 GB S music.p0.s2 State: up D: k6 Size: 40 GB S music.p0.s3 State: up D: k7 Size: 40 GB S flac.p0.s3 State: up D: k7 Size: 64 GB S flac.p0.s2 State: up D: k6 Size: 64 GB S flac.p0.s1 State: up D: k5 Size: 64 GB S flac.p0.s0 State: stale D: k4 Size: 64 GB S video.p0.s3 State: up D: k7 Size: 100 GB S video.p0.s2 State: up D: k6 Size: 100 GB S video.p0.s1 State: up D: k5 Size: 100 GB S video.p0.s0 State: stale D: k4 Size: 100 GB Although not relevant for Problem #3, here is a history of the controller card failure: First one of the raid5 drives had a fatal read failure, causing that drive to be dropped: Mar 31 04:25:38 kay kernel: ad10: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=267860265 Mar 31 04:25:47 kay kernel: ad10: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Mar 31 04:25:51 kay kernel: ad10: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Mar 31 04:25:55 kay kernel: ad10: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Mar 31 04:25:59 kay kernel: ad10: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Mar 31 04:26:03 kay kernel: ad10: WARNING - SET_MULTI taskqueue timeout - completing request directly Mar 31 04:26:03 kay kernel: ad10: TIMEOUT - READ_DMA retrying (0 retries left) LBA=267860265 Mar 31 04:26:05 kay kernel: ad10: FAILURE - READ_DMA status=ff<BUSY,READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> error=ff<ICRC,UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA,ILLEGAL_LENGTH> LBA=267860265 Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk picts.p0.s0 state change: up -> down Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex picts.p0 state change: up -> degraded Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk music.p0.s0 state change: up -> down Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex music.p0 state change: up -> degraded Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk flac.p0.s0 state change: up -> down Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex flac.p0 state change: up -> degraded Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk video.p0.s0 state change: up -> down Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex video.p0 state change: up -> degraded Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75620171776, length=131072)]error = 5 Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75620302848, length=131072)]error = 6 Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75621220352, length=131072)]error = 6 Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75620433920, length=131072)]error = 6 ... [continued similar messages] Mar 31 04:26:10 kay kernel: ad10: TIMEOUT - READ_DMA retrying (1 retry left) LBA=268002601 Mar 31 04:26:10 kay kernel: GEOM_VINUM: lost drive 'k4' A few hundred of the ENXIO (error = 6) kernel messages messages happen. Notice the ata subsystem doesn't report that ad10 was ever detached. It was just unresponsive and a reinit blew away the drive for good. Later, a drive not part of the RAID5 but part of some mirrors and a slice decides to wig out, causing the ata subsystem to drop the drive forever until a reboot (Problem #1): Mar 31 04:53:50 kay kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Mar 31 04:53:54 kay kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Mar 31 04:53:58 kay kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Mar 31 04:54:02 kay kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Mar 31 04:54:06 kay kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Mar 31 04:54:06 kay kernel: ad12: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=497537109 Mar 31 04:54:21 kay kernel: ad12: FAILURE - device detached Mar 31 04:54:21 kay kernel: subdisk12: detached Mar 31 04:54:21 kay kernel: ad12: detached Which in turn causes the expected: Mar 31 04:54:21 kay kernel: GEOM_VINUM: subdisk projects.p0.s0 state change: up -> down Mar 31 04:54:21 kay kernel: GEOM_VINUM: plex projects.p0 state change: up -> down Mar 31 04:54:21 kay kernel: GEOM_VINUM: subdisk home.p0.s0 state change: up -> down Mar 31 04:54:21 kay kernel: GEOM_VINUM: plex home.p0 state change: up -> down Mar 31 04:54:21 kay kernel: GEOM_VINUM: subdisk extra.p0.s0 state change: up -> down Mar 31 04:54:21 kay kernel: GEOM_VINUM: plex extra.p0 state change: up -> down Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=432168124416, length=131072)]error = 6 Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=432168124416, length=65536)]error = 6 Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=432168189952, length=131072)]error = 6 Mar 31 04:54:21 kay kernel: GEOM_VINUM: g_access failed on drive k2, errno 6 Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=95073779712, length=2048)]error = 6 Mar 31 04:54:21 kay last message repeated 4 times Mar 31 04:54:21 kay kernel: GEOM_VINUM: lost drive 'k2' Followed by a the failure of the geom subsystem to ever properly deal with the problematic filesystem until forcibly rebooted (Problem 2): Mar 31 04:54:50 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6 Mar 31 04:55:19 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6 Mar 31 04:57:15 kay last message repeated 4 times Mar 31 05:06:55 kay last message repeated 20 times Mar 31 05:17:05 kay last message repeated 21 times Mar 31 05:27:15 kay last message repeated 21 times Mar 31 05:36:56 kay last message repeated 20 times Mar 31 05:47:06 kay last message repeated 21 times Mar 31 05:57:15 kay last message repeated 21 times Mar 31 06:06:56 kay last message repeated 20 times Mar 31 06:17:06 kay last message repeated 21 times Mar 31 06:27:16 kay last message repeated 21 times Mar 31 06:36:57 kay last message repeated 20 times Mar 31 06:47:06 kay last message repeated 21 times Mar 31 06:57:16 kay last message repeated 21 times Mar 31 07:06:27 kay last message repeated 19 times Mar 31 07:06:34 kay kernel: g_vfs_done():gvinum/extra[READ(offset=95073779712, length=2048)]error = 6 Mar 31 07:06:57 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6 Mar 31 07:07:26 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6 Mar 31 07:09:51 kay last message repeated 5 times Mar 31 07:19:32 kay last message repeated 20 times Mar 31 07:29:42 kay last message repeated 21 times Mar 31 07:39:23 kay last message repeated 20 times Mar 31 07:49:34 kay last message repeated 21 times Mar 31 07:59:43 kay last message repeated 21 times Mar 31 08:09:23 kay last message repeated 20 times Mar 31 08:19:33 kay last message repeated 21 times Mar 31 08:29:43 kay last message repeated 21 times Mar 31 08:39:25 kay last message repeated 20 times Mar 31 08:49:35 kay last message repeated 21 times Mar 31 08:59:45 kay last message repeated 21 times Mar 31 09:09:26 kay last message repeated 20 times Mar 31 09:19:35 kay last message repeated 21 times Mar 31 09:29:46 kay last message repeated 21 times Mar 31 09:39:27 kay last message repeated 20 times Mar 31 09:49:37 kay last message repeated 21 times Mar 31 09:59:47 kay last message repeated 21 times Mar 31 10:09:28 kay last message repeated 20 times Mar 31 10:19:39 kay last message repeated 21 times Mar 31 10:29:49 kay last message repeated 21 times Mar 31 10:39:30 kay last message repeated 20 times Mar 31 10:49:40 kay last message repeated 21 times Mar 31 10:59:51 kay last message repeated 21 times Mar 31 11:09:32 kay last message repeated 20 times Mar 31 11:19:42 kay last message repeated 21 times Mar 31 11:29:23 kay last message repeated 20 times Mar 31 11:39:32 kay last message repeated 21 times Mar 31 11:49:42 kay last message repeated 21 times Mar 31 11:54:23 kay last message repeated 9 times Mar 31 11:54:27 kay syslogd: exiting on signal 15 -- Rick C. Petty
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070331191116.GA87710>