Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jun 2022 17:08:02 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        John Doherty <bsdlists@jld3.net>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: "spare-X" device remains after resilvering
Message-ID:  <CAOtMX2gPZk3kW5_P_-MGZROrURqr%2Bf3rqEDM6XooLO2v=DbgDA@mail.gmail.com>
In-Reply-To: <768F3745-D7FF-48C8-BA28-ABEB49BAFAA8@jld3.net>
References:  <34A91D31-1883-40AE-82F3-57B783532ED7@jld3.net> <CAOtMX2iv3g-pA=XciiFCoH-6y%2B=RKeJ61TnOvJm2bPNoc_WwEg@mail.gmail.com> <768F3745-D7FF-48C8-BA28-ABEB49BAFAA8@jld3.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jun 20, 2022 at 4:54 PM John Doherty <bsdlists@jld3.net> wrote:
>
> On Mon 2022-06-20 03:40 PM MDT -0600, <asomers@freebsd.org> wrote:
>
> > On Mon, Jun 20, 2022 at 7:42 AM John Doherty <bsdlists@jld3.net>
> > wrote:
> >>
> >> Hi, I have a zpool that currently looks like this (some lines elided
> >> for
> >> brevity; all omitted devices are online and apparently fine):
> >>
> >>    pool: zp1
> >>   state: DEGRADED
> >> status: One or more devices has been taken offline by the
> >> administrator.
> >>          Sufficient replicas exist for the pool to continue
> >> functioning
> >> in a
> >>          degraded state.
> >> action: Online the device using 'zpool online' or replace the device
> >> with
> >>          'zpool replace'.
> >>    scan: resilvered 1.76T in 1 days 00:38:14 with 0 errors on Sun Jun
> >> 19
> >> 22:31:46 2022
> >> config:
> >>
> >>          NAME                       STATE     READ WRITE CKSUM
> >>          zp1                        DEGRADED     0     0     0
> >>            raidz2-0                 ONLINE       0     0     0
> >>              gpt/disk0              ONLINE       0     0     0
> >>              gpt/disk1              ONLINE       0     0     0
> >>              ...
> >>              gpt/disk9              ONLINE       0     0     0
> >>            raidz2-1                 ONLINE       0     0     0
> >>              gpt/disk10             ONLINE       0     0     0
> >>              ...
> >>              gpt/disk19             ONLINE       0     0     0
> >>            raidz2-2                 ONLINE       0     0     0
> >>              gpt/disk20             ONLINE       0     0     0
> >>              ...
> >>              gpt/disk29             ONLINE       0     0     0
> >>            raidz2-3                 DEGRADED     0     0     0
> >>              gpt/disk30             ONLINE       0     0     0
> >>              3343132967577870793    OFFLINE      0     0     0  was
> >> /dev/gpt/disk31
> >>              ...
> >>              spare-9                DEGRADED     0     0     0
> >>                6960108738988598438  OFFLINE      0     0     0  was
> >> /dev/gpt/disk39
> >>                gpt/disk41           ONLINE       0     0     0
> >>          spares
> >>            16713572025248921080     INUSE     was /dev/gpt/disk41
> >>            gpt/disk42               AVAIL
> >>            gpt/disk43               AVAIL
> >>            gpt/disk44               AVAIL
> >>
> >> My question is why the "spare-9" device still exists after the
> >> resilvering completed. Based on past experience, my expectation was
> >> that
> >> it would exist for the duration of the resilvering and after that,
> >> only
> >> the "gpt/disk41" device would appear in the output of "zpool status."
> >>
> >> I also expected that when the resilvering completed, the "was
> >> /dev/gpt/disk41" device would be removed from the list of spares.
> >>
> >> I took the "was /dev/gpt/disk31" device offline deliberately because
> >> it
> >> was causing a lot of "CAM status: SCSI Status Error" errors. Next
> >> step
> >> for this pool is to replace that with one of the available spares but
> >> I'd like to get things looking a little cleaner before doing that.
> >>
> >> I don't have much in the way of ideas here. One thought was to export
> >> the pool and then do "zpool import zp1 -d /dev/gpt" and see if that
> >> cleaned things up.
> >>
> >> This system is running 12.2-RELEASE-p4, which I know is a little out
> >> of
> >> date. I'm going to update it 13.1-RELEASE soon but the more immediate
> >> need is to get this zpool in good shape.
> >>
> >> Any insights or advice much appreciated. Happy to provide any further
> >> info that might be helpful. Thanks.
> >
> > This is expected behavior.  I take it that you were expecting for
> > 6960108738988598438 to be removed from the configuration, replaced by
> > gpt/disk41, and for gpt/disk41 to disappear from the spare list?  That
> > didn't happen because ZFS considers anything in the spare list to be a
> > permanent spare.  It will never automatically remove a disk from the
> > spare list.  Instead, zfs is expecting for you to provide it with a
> > permanent replacement for the failed disk.  Once resilvering to the
> > permanent replacement is complete, then it will automatically detach
> > the spare.
> >
> > OTOH, if you really want gpt/disk41 to be the permanent replacement, I
> > think you can accomplish that with some combination of the following
> > commands:
> >
> > zpool detach zp1 6960108738988598438
> > zpool remove zp1 gpt/disk41
>
> Ah, OK, I did not understand that spares worked that way.
>
> I don't think I can detach anything because this is all raidz2 and
> detach only works with components of mirrors.
>
> But experimenting with a zpool created from files, I can see that spares
> work as you describe, e.g.:
>
> # zpool status zpX
>    pool: zpX
>   state: ONLINE
> config:
>
>         NAME          STATE     READ WRITE CKSUM
>         zpX           ONLINE       0     0     0
>           raidz2-0    ONLINE       0     0     0
>             /vd/vd00  ONLINE       0     0     0
>             /vd/vd01  ONLINE       0     0     0
>             /vd/vd02  ONLINE       0     0     0
>             /vd/vd03  ONLINE       0     0     0
>             /vd/vd04  ONLINE       0     0     0
>             /vd/vd05  ONLINE       0     0     0
>             /vd/vd06  ONLINE       0     0     0
>             /vd/vd07  ONLINE       0     0     0
>         spares
>           /vd/vd08    AVAIL
>
> errors: No known data errors
>
> Then:
>
> # zpool offline zpX /vd/vd00
> # zpool replace zpX /vd/vd00 /vd/vd08
> # zpool status zpX
>    pool: zpX
>   state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using 'zpool online' or replace the device
> with
>         'zpool replace'.
>    scan: resilvered 456K in 00:00:00 with 0 errors on Mon Jun 20
> 16:18:46 2022
> config:
>
>         NAME            STATE     READ WRITE CKSUM
>         zpX             DEGRADED     0     0     0
>           raidz2-0      DEGRADED     0     0     0
>             spare-0     DEGRADED     0     0     0
>               /vd/vd00  OFFLINE      0     0     0
>               /vd/vd08  ONLINE       0     0     0
>             /vd/vd01    ONLINE       0     0     0
>             /vd/vd02    ONLINE       0     0     0
>             /vd/vd03    ONLINE       0     0     0
>             /vd/vd04    ONLINE       0     0     0
>             /vd/vd05    ONLINE       0     0     0
>             /vd/vd06    ONLINE       0     0     0
>             /vd/vd07    ONLINE       0     0     0
>         spares
>           /vd/vd08      INUSE     currently in use
>
> errors: No known data errors
>
> To get that pool out of the degraded state, I must replace the offline
> device with something other than a configured spare, like this:
>
> # zpool replace zpX /vd/vd00 /vd/vd09
> # zpool status zpX
>    pool: zpX
>   state: ONLINE
>    scan: resilvered 516K in 00:00:00 with 0 errors on Mon Jun 20
> 16:20:36 2022
> config:
>
>         NAME          STATE     READ WRITE CKSUM
>         zpX           ONLINE       0     0     0
>           raidz2-0    ONLINE       0     0     0
>             /vd/vd09  ONLINE       0     0     0
>             /vd/vd01  ONLINE       0     0     0
>             /vd/vd02  ONLINE       0     0     0
>             /vd/vd03  ONLINE       0     0     0
>             /vd/vd04  ONLINE       0     0     0
>             /vd/vd05  ONLINE       0     0     0
>             /vd/vd06  ONLINE       0     0     0
>             /vd/vd07  ONLINE       0     0     0
>         spares
>           /vd/vd08    AVAIL
>
> errors: No known data errors
>
> After that, there is no remnant of /vd/vd00 and /vd/vd08 has gone back
> as an available spare.
>
> So with my real zpool, I should be able to remove one of the available
> spares and replace the offline device with that. When it finishes
> resilvering, there should be no more remnant of what was gpt/disk39 and
> gpt/disk41 should go back as an available spare.
>
> Or alternatively, physically remove and replace the offline disk and
> then do "zpool replace zp1 6960108738988598438 <new disk>".
>
> Seems like either of those will get the whole pool back to "online"
> status once I also replace the other offline disk with something other
> than a configured spare.
>
> This was all a misunderstanding on my part of how spares work. Thanks!

Ahh, but you can detach in this case, because the "spare-9" vdev is
itself a type of mirror.  Try that command.  I think it will do what
you want, with no extra resilvering required.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2gPZk3kW5_P_-MGZROrURqr%2Bf3rqEDM6XooLO2v=DbgDA>