Date: Mon, 20 Jun 2022 16:50:32 -0600 From: "John Doherty" <bsdlists@jld3.net> To: "Alan Somers" <asomers@freebsd.org> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: "spare-X" device remains after resilvering Message-ID: <768F3745-D7FF-48C8-BA28-ABEB49BAFAA8@jld3.net> In-Reply-To: <CAOtMX2iv3g-pA=XciiFCoH-6y%2B=RKeJ61TnOvJm2bPNoc_WwEg@mail.gmail.com> References: <34A91D31-1883-40AE-82F3-57B783532ED7@jld3.net> <CAOtMX2iv3g-pA=XciiFCoH-6y%2B=RKeJ61TnOvJm2bPNoc_WwEg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon 2022-06-20 03:40 PM MDT -0600, <asomers@freebsd.org> wrote: > On Mon, Jun 20, 2022 at 7:42 AM John Doherty <bsdlists@jld3.net> > wrote: >> >> Hi, I have a zpool that currently looks like this (some lines elided >> for >> brevity; all omitted devices are online and apparently fine): >> >> pool: zp1 >> state: DEGRADED >> status: One or more devices has been taken offline by the >> administrator. >> Sufficient replicas exist for the pool to continue >> functioning >> in a >> degraded state. >> action: Online the device using 'zpool online' or replace the device >> with >> 'zpool replace'. >> scan: resilvered 1.76T in 1 days 00:38:14 with 0 errors on Sun Jun >> 19 >> 22:31:46 2022 >> config: >> >> NAME STATE READ WRITE CKSUM >> zp1 DEGRADED 0 0 0 >> raidz2-0 ONLINE 0 0 0 >> gpt/disk0 ONLINE 0 0 0 >> gpt/disk1 ONLINE 0 0 0 >> ... >> gpt/disk9 ONLINE 0 0 0 >> raidz2-1 ONLINE 0 0 0 >> gpt/disk10 ONLINE 0 0 0 >> ... >> gpt/disk19 ONLINE 0 0 0 >> raidz2-2 ONLINE 0 0 0 >> gpt/disk20 ONLINE 0 0 0 >> ... >> gpt/disk29 ONLINE 0 0 0 >> raidz2-3 DEGRADED 0 0 0 >> gpt/disk30 ONLINE 0 0 0 >> 3343132967577870793 OFFLINE 0 0 0 was >> /dev/gpt/disk31 >> ... >> spare-9 DEGRADED 0 0 0 >> 6960108738988598438 OFFLINE 0 0 0 was >> /dev/gpt/disk39 >> gpt/disk41 ONLINE 0 0 0 >> spares >> 16713572025248921080 INUSE was /dev/gpt/disk41 >> gpt/disk42 AVAIL >> gpt/disk43 AVAIL >> gpt/disk44 AVAIL >> >> My question is why the "spare-9" device still exists after the >> resilvering completed. Based on past experience, my expectation was >> that >> it would exist for the duration of the resilvering and after that, >> only >> the "gpt/disk41" device would appear in the output of "zpool status." >> >> I also expected that when the resilvering completed, the "was >> /dev/gpt/disk41" device would be removed from the list of spares. >> >> I took the "was /dev/gpt/disk31" device offline deliberately because >> it >> was causing a lot of "CAM status: SCSI Status Error" errors. Next >> step >> for this pool is to replace that with one of the available spares but >> I'd like to get things looking a little cleaner before doing that. >> >> I don't have much in the way of ideas here. One thought was to export >> the pool and then do "zpool import zp1 -d /dev/gpt" and see if that >> cleaned things up. >> >> This system is running 12.2-RELEASE-p4, which I know is a little out >> of >> date. I'm going to update it 13.1-RELEASE soon but the more immediate >> need is to get this zpool in good shape. >> >> Any insights or advice much appreciated. Happy to provide any further >> info that might be helpful. Thanks. > > This is expected behavior. I take it that you were expecting for > 6960108738988598438 to be removed from the configuration, replaced by > gpt/disk41, and for gpt/disk41 to disappear from the spare list? That > didn't happen because ZFS considers anything in the spare list to be a > permanent spare. It will never automatically remove a disk from the > spare list. Instead, zfs is expecting for you to provide it with a > permanent replacement for the failed disk. Once resilvering to the > permanent replacement is complete, then it will automatically detach > the spare. > > OTOH, if you really want gpt/disk41 to be the permanent replacement, I > think you can accomplish that with some combination of the following > commands: > > zpool detach zp1 6960108738988598438 > zpool remove zp1 gpt/disk41 Ah, OK, I did not understand that spares worked that way. I don't think I can detach anything because this is all raidz2 and detach only works with components of mirrors. But experimenting with a zpool created from files, I can see that spares work as you describe, e.g.: # zpool status zpX pool: zpX state: ONLINE config: NAME STATE READ WRITE CKSUM zpX ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 /vd/vd00 ONLINE 0 0 0 /vd/vd01 ONLINE 0 0 0 /vd/vd02 ONLINE 0 0 0 /vd/vd03 ONLINE 0 0 0 /vd/vd04 ONLINE 0 0 0 /vd/vd05 ONLINE 0 0 0 /vd/vd06 ONLINE 0 0 0 /vd/vd07 ONLINE 0 0 0 spares /vd/vd08 AVAIL errors: No known data errors Then: # zpool offline zpX /vd/vd00 # zpool replace zpX /vd/vd00 /vd/vd08 # zpool status zpX pool: zpX state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: resilvered 456K in 00:00:00 with 0 errors on Mon Jun 20 16:18:46 2022 config: NAME STATE READ WRITE CKSUM zpX DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 spare-0 DEGRADED 0 0 0 /vd/vd00 OFFLINE 0 0 0 /vd/vd08 ONLINE 0 0 0 /vd/vd01 ONLINE 0 0 0 /vd/vd02 ONLINE 0 0 0 /vd/vd03 ONLINE 0 0 0 /vd/vd04 ONLINE 0 0 0 /vd/vd05 ONLINE 0 0 0 /vd/vd06 ONLINE 0 0 0 /vd/vd07 ONLINE 0 0 0 spares /vd/vd08 INUSE currently in use errors: No known data errors To get that pool out of the degraded state, I must replace the offline device with something other than a configured spare, like this: # zpool replace zpX /vd/vd00 /vd/vd09 # zpool status zpX pool: zpX state: ONLINE scan: resilvered 516K in 00:00:00 with 0 errors on Mon Jun 20 16:20:36 2022 config: NAME STATE READ WRITE CKSUM zpX ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 /vd/vd09 ONLINE 0 0 0 /vd/vd01 ONLINE 0 0 0 /vd/vd02 ONLINE 0 0 0 /vd/vd03 ONLINE 0 0 0 /vd/vd04 ONLINE 0 0 0 /vd/vd05 ONLINE 0 0 0 /vd/vd06 ONLINE 0 0 0 /vd/vd07 ONLINE 0 0 0 spares /vd/vd08 AVAIL errors: No known data errors After that, there is no remnant of /vd/vd00 and /vd/vd08 has gone back as an available spare. So with my real zpool, I should be able to remove one of the available spares and replace the offline device with that. When it finishes resilvering, there should be no more remnant of what was gpt/disk39 and gpt/disk41 should go back as an available spare. Or alternatively, physically remove and replace the offline disk and then do "zpool replace zp1 6960108738988598438 <new disk>". Seems like either of those will get the whole pool back to "online" status once I also replace the other offline disk with something other than a configured spare. This was all a misunderstanding on my part of how spares work. Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?768F3745-D7FF-48C8-BA28-ABEB49BAFAA8>