Date: Fri, 06 Sep 2024 16:22:44 -0500 From: Wes Morgan <morganw@gmail.com> To: Chris Ross <cross+freebsd@distal.com> Cc: freebsd-fs@freebsd.org Subject: Re: Unable to replace drive in raidz1 Message-ID: <50B791D8-F0CC-431E-93B8-834D57AB3C14@gmail.com> In-Reply-To: <E93A9CA8-6705-4C26-9F33-B620A365F4BD@distal.com> References: <5ED5CB56-2E2A-4D83-8CDA-6D6A0719ED19@distal.com> <AC67D073-D476-41F5-AC53-F671430BB493@distal.com> <CAOtMX2h52d0vtceuwcDk2dzkH-fZW32inhk-dfjLMJxetVXKYg@mail.gmail.com> <CB79EC2B-E793-4561-95E7-D1CEEEFC1D72@distal.com> <CAOtMX2i_zFYuOnEK_aVkpO_M8uJCvGYW%2BSzLn3OED4n5fKFoEA@mail.gmail.com> <6A20ABDA-9BEA-4526-94C1-5768AA564C13@distal.com> <CAOtMX2jfcd43sBpHraWA=5e_Ka=hMD654m-5=boguPPbYXE4yw@mail.gmail.com> <0CF1E2D7-6C82-4A8B-82C3-A5BF1ED939CF@distal.com> <CAOtMX2hRJvt9uhctKvXO4R2tUNq9zeCEx6NZmc7Vk7fH=HO8eA@mail.gmail.com> <29003A7C-745D-4A06-8558-AE64310813EA@distal.com> <42346193-AD06-4D26-B0C6-4392953D21A3@gmail.com> <E6C615C1-E9D2-4F0D-8DC2-710BAAF10954@distal.com> <E85B00B1-7205-486D-800C-E6837780E819@gmail.com> <E93A9CA8-6705-4C26-9F33-B620A365F4BD@distal.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On September 6, 2024 2:34:36 PM CDT, Chris Ross <cross+freebsd@distal=2Ecom= > wrote: > > >> On Sep 6, 2024, at 15:16, Wes Morgan <morganw@gmail=2Ecom> wrote: >>=20 >> You probably don't want that=2E You will have to use the glabel dev, wh= ich will not be the same size as your other devices=2E IIRC you have no con= trol over what device node the system finds first for the pool=2E Even if y= ou use GPT labels, the daXpY device will still exist=2E=20 > >Right=2E But if I don=E2=80=99t _use_ those device names, it won=E2=80= =99t matter=2E If I use /dev/label/foo, or /dev/gpt/foo, I=E2=80=99ll just= always use those=2E I just did that with the ufs disk I have since it mov= ed names, now it=E2=80=99s "/dev/ufs/drive12=E2=80=9D in /etc/fstab et al= =2E The labels are helpful for fstab, but zfs doesn't need fstab=2E In the ear= ly days of zfs on freebsd the unpartitioned device was recommended; maybe t= hat's not accurate any longer, but I still follow it for a pool that contai= ns vdevs with multiple devices (raidz)=2E=20 If you use, e=2Eg=2E, da0 in a pool, you cannot later replace it with a la= beled device of the same size; it won't have enough sectors=2E=20 >I want to have some sort of label=2E I=E2=80=99d rather not have to add = a partitioning scheme to the disk if I know I=E2=80=99m just going to use t= he whole disk just to get a label, but I suppose if I have to I can=2E Tho= ugh I=E2=80=99d have to do it one disk at a time=2E :-) ZFS will absolutely find the device if it is readable=2E The label on ever= y device contains enough metadata to describe the entire vdev (and the pool= I believe), including the missing devices=2E It's very good at finding the= m=2E The clearlabel command was added because it was a pain to get zfs to g= ive up on a disk that has been repurposed=2E You really don't need the labe= ls, but if you have trouble figuring out which disk is which, that may be t= he only way for you to be sure=2E >>=20 >>> The former da3 is off-line, out of the chassis=2E I replaced a disk i= n a full chassis, having them both online at the same time is not possible= =2E That drive in ZFS=E2=80=99s mind is only faulted because I tried =E2= =80=9Czpool offline -f=E2=80=9D on it to see if that helped=2E >>=20 >> It sounds like you have replaced the wrong device=2E Check the "zpool h= istory" to see what you did=2E=20 >>=20 >> In your earlier message, three devices were shown in each raidz, when w= hat you should be seeing is that one raidz has an offline device identified= by guid and maybe "was /dev/da3" that is being replaced, along with the re= placement device=2E I don't see any of that=2E=20 > >History attached=2E There is no replacement device (sub-vdev) until afte= r the =E2=80=9Czpool replace=E2=80=9D starts, which it won=E2=80=99t=2E > >>> I didn=E2=80=99t initiate a replace until after the disks were physica= lly changed=2E Although in this conversation realize that things likely go= t confused by the replacement in the kernel=E2=80=99s mind of da3 with what= used to be da4=2E :-/ >>=20 >> This is why your zpool history will be helpful=2E What did you actually= try to replace, and what did you mean to replace=2E=20 > >All of my history since the last previous boot in May=2E > >2024-09-05=2E09:40:14 zpool offline tank da3 >2024-09-05=2E14:26:44 zpool import -c /etc/zfs/zpool=2Ecache -a -N >2024-09-05=2E14:32:45 zpool import -c /etc/zfs/zpool=2Ecache -a -N >2024-09-05=2E14:52:18 zpool offline tank da3 >2024-09-05=2E14:53:51 zpool offline tank da3 >2024-09-05=2E14:59:43 zpool offline -f tank da3 >2024-09-05=2E15:02:53 zpool clear tank >2024-09-05=2E15:07:41 zpool online tank da3 >2024-09-05=2E15:10:00 zpool add tank spare da10 >2024-09-05=2E15:10:20 zpool offline -f tank da3 >2024-09-05=2E15:35:23 zpool remove tank da10 >2024-09-05=2E15:54:35 zpool scrub tank >2024-09-05=2E16:01:12 zpool set autoreplace=3Don tank >2024-09-05=2E16:01:24 zpool set autoexpand=3Don tank >2024-09-05=2E16:02:16 zpool add -o ashift=3D9 tank spare da10 >2024-09-06=2E10:10:20 zpool remove tank da10 > >So, I offline=E2=80=99d the disk-to-be-replaced at 09:40 yesterday, then = I shut the system down, removed that physical device replacing it with a la= rger disk, and rebooted=2E I suspect the =E2=80=9Coffline=E2=80=9Ds after = that are me experimenting when it was telling me it couldn=E2=80=99t start = the replace action I was asking for=2E This is probably where you made your mistake=2E Rebooting shifted another = device into da3=2E When you tried to offline it, you were probably either t= argeting a device in a different raidz or one that wasn't in the pool=2E Th= e output of those original offline commands would have been informative=2E = You could also check dmesg and map the serial numbers to device assignments= to figure out what device moved to da3=2E >The scrub I started yesterday just because the replace says sometihng abo= ut an operation in progress, so I did that=2E It completed with no issues,= but nothing changed w=2Er=2Et=2E my current problem=2E > >I=E2=80=99m pretty sure the problem here is that the old da3 went away, a= nd a new da3 came online as a member of raidz1-1=2E The new disk I added c= ame online as da10, for some reason=2E I had to resolve the issue of the U= FS disk which used to be da10 now being da9, but that was easy enough=2E J= ust unexpected=2E Sounds about right=2E In another message it seemed like the pool had start= ed an autoreplace=2E So I assume you have zfsd enabled? That is what issues= the replace command=2E Strange that it is not anywhere in the pool history= =2E There should be syslog entries for any actions it took=2E In your case, it appears that you had two missing devices - the original "= da3" that was physically removed, and the new da3 that you forced offline= =2E You added da10 as a spare, when what you needed to do was a replace=2E = Spare devices do not auto-replace without zfsd running and autoreplace set = to on=2E This should all be reported in zpool status=2E In your original message, t= here is no sign of a replacement in progress or a spare device, assuming yo= u didn't omit something=2E If the pool is only showing that a single device= is missing, and that device is to be replaced by da10, zero out the first = and last sectors (I think a zfs label is 128k?) to wipe out any labels and = use the replace command, not spare, e=2Eg=2E "zpool replace tank da3 da10",= or use the missing guid as suggested elsewhere=2E This should work based o= n the information provided=2E
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50B791D8-F0CC-431E-93B8-834D57AB3C14>