FreeBSD Mail Archives

Date:      Sat, 31 Mar 2007 13:11:16 -0600
From:      "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
To:        freebsd-geom@freebsd.org
Subject:   gvinum rebuildparity breakage
Message-ID:  <20070331191116.GA87710@keira.kiwi-computer.com>

next in thread | raw e-mail | index | archive | help

Due to ata driver problems, one of the disks which holds part of my RAID5
volumes was dropped (see below).  gvinum did the right thing and reported
the drive lost and all the relevent subdisks in my RAID5 plexes were
changed to degraded.

Problem 1).  The ata driver would not bring up the dropped drive.  Numerous
attempts at "atacontrol reinit" and detach/attach sequences would not bring
the drive back.  I've reported this problem in the past with this Promise
PDC40718 SATA300 controller card and have not seen or heard of a
resolution.  My guess is the ata folks blame it on the Promise chip, fine.

Problem 2).  The geom subsystem continued to spit out error messages yet
fail to allow the volumes be unmounted.  I've yet to witness geom do
anything gracefully such as allow the mountpoints to be forcibly unmounted
(without panicking) or perform a shutdown without dirty buffers.  Fine.

Problems 1 & 2 required a reboot; so much for my 115-day uptime.  I'll
accept this, because I expect gvinum to do the Right Thing...  which brings
me to:

Problem 3).  I have four volumes which have degraded RAID5 plexes, as is
expected from the dropped drive.  However the drive is up now but gvinum
won't let me rebuild the parity:

# gvinum rebuildparity music.p0
gvinum: plex music.p0 is not completely accessible

What a strange message, unless it's referring to the stale subdisk, but it
won't let me do anything about that either:

# gvinum start music.p0.s0
gvinum: can't start: cannot start 'music.p0.s0' - not yet supported

I'm not keen to try forcibly setting the state with "gvinum setstate -f"
because: the man page states this is for diagnostic purposes only, and the
man page doesn't give a list of possible states anyway.  This documentation
has been lacking since the 5.x nerf of vinum/gvinum anyway so the man page
and "gvinum help" are quite useless (Problem #4).  I'd rather not
experiment with this command since the volumes are important, so I'm left
without options.

So, how do I rebuild by RAID5 volumes?  This used to work fine in pre-geom
vinum and I'm pretty sure this worked in gvinum at one point.  I'm running
6.1-STABLE (RELENG_6 as of 2006-Jul-21) with a GENERIC kernel.  Here is my
vinum configuration:

# gvinum l
8 drives:
D k0                    State: up       /dev/ad18       A: 92253/305245 MB (30%)
D k3                    State: up       /dev/ad16       A: 0/305245 MB (0%)
D k1                    State: up       /dev/ad14       A: 92253/305245 MB (30%)
D k2                    State: up       /dev/ad12       A: 0/305245 MB (0%)
D k4                    State: up       /dev/ad10       A: 77036/286188 MB (26%)
D k7                    State: up       /dev/ad8        A: 77036/286188 MB (26%)
D k5                    State: up       /dev/ad6        A: 77036/286188 MB (26%)
D k6                    State: up       /dev/ad4        A: 77036/286188 MB (26%)

10 volumes:
V extra                 State: up       Plexes:       1 Size:        524 GB
V 3rd-party             State: up       Plexes:       2 Size:         16 GB
V home                  State: up       Plexes:       2 Size:         32 GB
V projects              State: up       Plexes:       2 Size:       4096 MB
V bsd                   State: up       Plexes:       2 Size:        128 GB
V media                 State: up       Plexes:       2 Size:         64 GB
V picts                 State: up       Plexes:       1 Size:        768 MB
V music                 State: up       Plexes:       1 Size:        120 GB
V flac                  State: up       Plexes:       1 Size:        192 GB
V video                 State: up       Plexes:       1 Size:        300 GB

15 plexes:
P extra.p0            S State: up       Subdisks:     2 Size:        524 GB
P 3rd-party.p0        C State: up       Subdisks:     1 Size:         16 GB
P 3rd-party.p1        C State: up       Subdisks:     1 Size:         16 GB
P home.p1             C State: up       Subdisks:     1 Size:         32 GB
P home.p0             C State: up       Subdisks:     1 Size:         32 GB
P projects.p1         C State: up       Subdisks:     1 Size:       4096 MB
P projects.p0         C State: up       Subdisks:     1 Size:       4096 MB
P bsd.p1              C State: up       Subdisks:     1 Size:        128 GB
P bsd.p0              C State: up       Subdisks:     1 Size:        128 GB
P media.p1            C State: up       Subdisks:     1 Size:         64 GB
P media.p0            C State: up       Subdisks:     1 Size:         64 GB
P picts.p0           R5 State: degraded Subdisks:     4 Size:        768 MB
P music.p0           R5 State: degraded Subdisks:     4 Size:        120 GB
P flac.p0            R5 State: degraded Subdisks:     4 Size:        192 GB
P video.p0           R5 State: degraded Subdisks:     4 Size:        300 GB

28 subdisks:
S extra.p0.s1           State: up       D: k3           Size:        262 GB
S extra.p0.s0           State: up       D: k2           Size:        262 GB
S 3rd-party.p0.s0       State: up       D: k0           Size:         16 GB
S 3rd-party.p1.s0       State: up       D: k1           Size:         16 GB
S home.p1.s0            State: up       D: k3           Size:         32 GB
S home.p0.s0            State: up       D: k2           Size:         32 GB
S projects.p1.s0        State: up       D: k3           Size:       4096 MB
S projects.p0.s0        State: up       D: k2           Size:       4096 MB
S bsd.p1.s0             State: up       D: k1           Size:        128 GB
S bsd.p0.s0             State: up       D: k0           Size:        128 GB
S media.p1.s0           State: up       D: k1           Size:         64 GB
S media.p0.s0           State: up       D: k0           Size:         64 GB
S picts.p0.s0           State: stale    D: k4           Size:        256 MB
S picts.p0.s1           State: up       D: k5           Size:        256 MB
S picts.p0.s2           State: up       D: k6           Size:        256 MB
S picts.p0.s3           State: up       D: k7           Size:        256 MB
S music.p0.s0           State: stale    D: k4           Size:         40 GB
S music.p0.s1           State: up       D: k5           Size:         40 GB
S music.p0.s2           State: up       D: k6           Size:         40 GB
S music.p0.s3           State: up       D: k7           Size:         40 GB
S flac.p0.s3            State: up       D: k7           Size:         64 GB
S flac.p0.s2            State: up       D: k6           Size:         64 GB
S flac.p0.s1            State: up       D: k5           Size:         64 GB
S flac.p0.s0            State: stale    D: k4           Size:         64 GB
S video.p0.s3           State: up       D: k7           Size:        100 GB
S video.p0.s2           State: up       D: k6           Size:        100 GB
S video.p0.s1           State: up       D: k5           Size:        100 GB
S video.p0.s0           State: stale    D: k4           Size:        100 GB


Although not relevant for Problem #3, here is a history of the controller
card failure:  First one of the raid5 drives had a fatal read failure,
causing that drive to be dropped:


Mar 31 04:25:38 kay kernel: ad10: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=267860265
Mar 31 04:25:47 kay kernel: ad10: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Mar 31 04:25:51 kay kernel: ad10: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Mar 31 04:25:55 kay kernel: ad10: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Mar 31 04:25:59 kay kernel: ad10: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Mar 31 04:26:03 kay kernel: ad10: WARNING - SET_MULTI taskqueue timeout - completing request directly
Mar 31 04:26:03 kay kernel: ad10: TIMEOUT - READ_DMA retrying (0 retries left) LBA=267860265
Mar 31 04:26:05 kay kernel: ad10: FAILURE - READ_DMA status=ff<BUSY,READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> error=ff<ICRC,UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA,ILLEGAL_LENGTH> LBA=267860265
Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk picts.p0.s0 state change: up -> down
Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex picts.p0 state change: up -> degraded
Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk music.p0.s0 state change: up -> down
Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex music.p0 state change: up -> degraded
Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk flac.p0.s0 state change: up -> down
Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex flac.p0 state change: up -> degraded
Mar 31 04:26:05 kay kernel: GEOM_VINUM: subdisk video.p0.s0 state change: up -> down
Mar 31 04:26:05 kay kernel: GEOM_VINUM: plex video.p0 state change: up -> degraded
Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75620171776, length=131072)]error = 5
Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75620302848, length=131072)]error = 6
Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75621220352, length=131072)]error = 6
Mar 31 04:26:05 kay kernel: g_vfs_done():gvinum/video[WRITE(offset=75620433920, length=131072)]error = 6
... [continued similar messages]
Mar 31 04:26:10 kay kernel: ad10: TIMEOUT - READ_DMA retrying (1 retry left) LBA=268002601
Mar 31 04:26:10 kay kernel: GEOM_VINUM: lost drive 'k4'


A few hundred of the ENXIO (error = 6) kernel messages messages happen.
Notice the ata subsystem doesn't report that ad10 was ever detached.  It
was just unresponsive and a reinit blew away the drive for good.  Later, a
drive not part of the RAID5 but part of some mirrors and a slice decides
to wig out, causing the ata subsystem to drop the drive forever until a
reboot (Problem #1):


Mar 31 04:53:50 kay kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Mar 31 04:53:54 kay kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Mar 31 04:53:58 kay kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Mar 31 04:54:02 kay kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Mar 31 04:54:06 kay kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly
Mar 31 04:54:06 kay kernel: ad12: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=497537109
Mar 31 04:54:21 kay kernel: ad12: FAILURE - device detached
Mar 31 04:54:21 kay kernel: subdisk12: detached
Mar 31 04:54:21 kay kernel: ad12: detached


Which in turn causes the expected:


Mar 31 04:54:21 kay kernel: GEOM_VINUM: subdisk projects.p0.s0 state change: up -> down
Mar 31 04:54:21 kay kernel: GEOM_VINUM: plex projects.p0 state change: up -> down
Mar 31 04:54:21 kay kernel: GEOM_VINUM: subdisk home.p0.s0 state change: up -> down
Mar 31 04:54:21 kay kernel: GEOM_VINUM: plex home.p0 state change: up -> down
Mar 31 04:54:21 kay kernel: GEOM_VINUM: subdisk extra.p0.s0 state change: up -> down
Mar 31 04:54:21 kay kernel: GEOM_VINUM: plex extra.p0 state change: up -> down
Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=432168124416, length=131072)]error = 6
Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=432168124416, length=65536)]error = 6
Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=432168189952, length=131072)]error = 6
Mar 31 04:54:21 kay kernel: GEOM_VINUM: g_access failed on drive k2, errno 6
Mar 31 04:54:21 kay kernel: g_vfs_done():gvinum/extra[READ(offset=95073779712, length=2048)]error = 6
Mar 31 04:54:21 kay last message repeated 4 times
Mar 31 04:54:21 kay kernel: GEOM_VINUM: lost drive 'k2'


Followed by a the failure of the geom subsystem to ever properly deal with
the problematic filesystem until forcibly rebooted (Problem 2):


Mar 31 04:54:50 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6
Mar 31 04:55:19 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6
Mar 31 04:57:15 kay last message repeated 4 times
Mar 31 05:06:55 kay last message repeated 20 times
Mar 31 05:17:05 kay last message repeated 21 times
Mar 31 05:27:15 kay last message repeated 21 times
Mar 31 05:36:56 kay last message repeated 20 times
Mar 31 05:47:06 kay last message repeated 21 times
Mar 31 05:57:15 kay last message repeated 21 times
Mar 31 06:06:56 kay last message repeated 20 times
Mar 31 06:17:06 kay last message repeated 21 times
Mar 31 06:27:16 kay last message repeated 21 times
Mar 31 06:36:57 kay last message repeated 20 times
Mar 31 06:47:06 kay last message repeated 21 times
Mar 31 06:57:16 kay last message repeated 21 times
Mar 31 07:06:27 kay last message repeated 19 times
Mar 31 07:06:34 kay kernel: g_vfs_done():gvinum/extra[READ(offset=95073779712, length=2048)]error = 6
Mar 31 07:06:57 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6
Mar 31 07:07:26 kay kernel: g_vfs_done():gvinum/extra[WRITE(offset=94997381120, length=16384)]error = 6
Mar 31 07:09:51 kay last message repeated 5 times
Mar 31 07:19:32 kay last message repeated 20 times
Mar 31 07:29:42 kay last message repeated 21 times
Mar 31 07:39:23 kay last message repeated 20 times
Mar 31 07:49:34 kay last message repeated 21 times
Mar 31 07:59:43 kay last message repeated 21 times
Mar 31 08:09:23 kay last message repeated 20 times
Mar 31 08:19:33 kay last message repeated 21 times
Mar 31 08:29:43 kay last message repeated 21 times
Mar 31 08:39:25 kay last message repeated 20 times
Mar 31 08:49:35 kay last message repeated 21 times
Mar 31 08:59:45 kay last message repeated 21 times
Mar 31 09:09:26 kay last message repeated 20 times
Mar 31 09:19:35 kay last message repeated 21 times
Mar 31 09:29:46 kay last message repeated 21 times
Mar 31 09:39:27 kay last message repeated 20 times
Mar 31 09:49:37 kay last message repeated 21 times
Mar 31 09:59:47 kay last message repeated 21 times
Mar 31 10:09:28 kay last message repeated 20 times
Mar 31 10:19:39 kay last message repeated 21 times
Mar 31 10:29:49 kay last message repeated 21 times
Mar 31 10:39:30 kay last message repeated 20 times
Mar 31 10:49:40 kay last message repeated 21 times
Mar 31 10:59:51 kay last message repeated 21 times
Mar 31 11:09:32 kay last message repeated 20 times
Mar 31 11:19:42 kay last message repeated 21 times
Mar 31 11:29:23 kay last message repeated 20 times
Mar 31 11:39:32 kay last message repeated 21 times
Mar 31 11:49:42 kay last message repeated 21 times
Mar 31 11:54:23 kay last message repeated 9 times
Mar 31 11:54:27 kay syslogd: exiting on signal 15


-- Rick C. Petty

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070331191116.GA87710>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation