From nobody Fri Sep 6 21:48:57 2024 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4X0qcv4rgNz5W3rl for ; Fri, 06 Sep 2024 21:49:19 +0000 (UTC) (envelope-from cross+freebsd@relay.distal.com) Received: from relay.wiredblade.com (relay.wiredblade.com [168.235.95.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4X0qcv0L9yz4F0X for ; Fri, 6 Sep 2024 21:49:18 +0000 (UTC) (envelope-from cross+freebsd@relay.distal.com) Authentication-Results: mx1.freebsd.org; none dkim-signature: v=1; a=rsa-sha256; d=relay.distal.com; s=mail; c=relaxed/relaxed; q=dns/txt; h=From:Subject:Date:Message-ID:To:CC:MIME-Version:Content-Type:Content-Transfer-Encoding:In-Reply-To:References; bh=rsuW4lzphMF0STMnv4uG0XZCLSiaD25MFQ3VawRzNfk=; b=qLleNXndckLDDwBqpATyZe74RdRNStexX6hGPpAMq8yh1QJqFlUvIDDAYPv0v6BajG6jtN4qsTVESVlqSNfVMEUjjeO1qvrTp2dJwDYPABb59LNl3MpjJMKYuQ1Z67h0RyCjvH4H7p3LQEJG88Qim5pEWy0csar9X8FfRsdVyDRi5214aIHF2unGYSBngRY1eAmrQnje16NfBfDAnvwZjhuwMlWcp4netwkGXD8zJx4JaUgSRYC6H+1b9o rN6qIoesekrMf8E/M9Jt9Z2C38FSXWLfdNdNF5H2Jfh5xN+UUtfMfMlS/8lmVX08m/U07xHmz09ked61361DIpW4Ma+w== Received: from mail.distal.com (pool-108-51-233-124.washdc.fios.verizon.net [108.51.233.124]) by relay.wiredblade.com with ESMTPSA (version=TLSv1.2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256) ; Fri, 6 Sep 2024 21:49:17 +0000 Received: from smtpclient.apple ( [2001:420:c0c4:1001::9f]) by tristain.distal.com (OpenSMTPD) with ESMTPSA id 11d8e4f5 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Fri, 6 Sep 2024 17:49:10 -0400 (EDT) Content-Type: text/plain; charset=utf-8 List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@FreeBSD.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3776.700.51\)) Subject: Re: Unable to replace drive in raidz1 From: Chris Ross In-Reply-To: <50B791D8-F0CC-431E-93B8-834D57AB3C14@gmail.com> Date: Fri, 6 Sep 2024 17:48:57 -0400 Cc: FreeBSD Filesystems Content-Transfer-Encoding: quoted-printable Message-Id: References: <5ED5CB56-2E2A-4D83-8CDA-6D6A0719ED19@distal.com> <6A20ABDA-9BEA-4526-94C1-5768AA564C13@distal.com> <0CF1E2D7-6C82-4A8B-82C3-A5BF1ED939CF@distal.com> <29003A7C-745D-4A06-8558-AE64310813EA@distal.com> <42346193-AD06-4D26-B0C6-4392953D21A3@gmail.com> <50B791D8-F0CC-431E-93B8-834D57AB3C14@gmail.com> To: Wes Morgan X-Mailer: Apple Mail (2.3776.700.51) X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; TAGGED_FROM(0.00)[freebsd]; ASN(0.00)[asn:3842, ipnet:168.235.92.0/22, country:US] X-Rspamd-Queue-Id: 4X0qcv0L9yz4F0X > On Sep 6, 2024, at 17:22, Wes Morgan wrote: >=20 > The labels are helpful for fstab, but zfs doesn't need fstab. In the = early days of zfs on freebsd the unpartitioned device was recommended; = maybe that's not accurate any longer, but I still follow it for a pool = that contains vdevs with multiple devices (raidz).=20 >=20 > If you use, e.g., da0 in a pool, you cannot later replace it with a = labeled device of the same size; it won't have enough sectors.=20 The problem is shown here. da3 was in a pool. Then, when the system = rebooted, da3 was the kernels name for a different device in a different = pool. Had I known then how to interact with the guid (status -g), I = likely would=E2=80=99ve been fine. >> So, I offline=E2=80=99d the disk-to-be-replaced at 09:40 yesterday, = then I shut the system down, removed that physical device replacing it = with a larger disk, and rebooted. I suspect the =E2=80=9Coffline=E2=80=9D= s after that are me experimenting when it was telling me it couldn=E2=80=99= t start the replace action I was asking for. >=20 > This is probably where you made your mistake. Rebooting shifted = another device into da3. When you tried to offline it, you were probably = either targeting a device in a different raidz or one that wasn't in the = pool. The output of those original offline commands would have been = informative. You could also check dmesg and map the serial numbers to = device assignments to figure out what device moved to da3. I offline=E2=80=99d =E2=80=9Cda3=E2=80=9D before I rebooted. After = rebooting, I tried the obvious and correct (i thought) =E2=80=9Czpool = replace da3 da10=E2=80=9D only to get the error I=E2=80=99ve been = getting since. Again, had I known how to use the guid for the device = that used to be da3 but now isn=E2=80=99t, that might=E2=80=99ve worked. = I can=E2=80=99t know now. Then, while trying to fix the problem, I likely made it worse by trying = to interact with da3, which in the pools brain was a missing disk in = raidz1-0, but the kernel also knew /dev/da3 to be a working disk (that = happened to be one of the drives in raidz1-1). I feel that zfs did = something wrong somewhere if it _ever_ tried to talk to /dev/da3 when I = said =E2=80=9Cda3=E2=80=9D after I rebooted and it found that device to = be part of raidz1-1, but. > Sounds about right. In another message it seemed like the pool had = started an autoreplace. So I assume you have zfsd enabled? That is what = issues the replace command. Strange that it is not anywhere in the pool = history. There should be syslog entries for any actions it took. I don=E2=80=99t think so. That message about some =E2=80=9Calready in = replacing/spare config=E2=80=9D came up before anything else. At which = point, I=E2=80=99d never had a spare in this pool, and there was no = replace shown in zpool status. > In your case, it appears that you had two missing devices - the = original "da3" that was physically removed, and the new da3 that you = forced offline. You added da10 as a spare, when what you needed to do = was a replace. Spare devices do not auto-replace without zfsd running = and autoreplace set to on. I did offline =E2=80=9Cda3=E2=80=9D a couple of times, again thinking I = was working with what zpool showed as =E2=80=9Cda3=E2=80=9D. If it did = anything with /dev/da3 there, then I think that may be a bug. Or, at = least, something that should be made more clear. It _didn=E2=80=99t_ = offline the diskid/DISK-K1GMBN9D from raidz1-1, which is what the kernel = has at da3. So. > This should all be reported in zpool status. In your original message, = there is no sign of a replacement in progress or a spare device, = assuming you didn't omit something. If the pool is only showing that a = single device is missing, and that device is to be replaced by da10, = zero out the first and last sectors (I think a zfs label is 128k?) to = wipe out any labels and use the replace command, not spare, e.g. "zpool = replace tank da3 da10", or use the missing guid as suggested elsewhere. = This should work based on the information provided. I=E2=80=99ve never seen a replacement going on, and I have had the new = disk =E2=80=9Cda10=E2=80=9D as a spare a couple of times while testing. = But it wasn=E2=80=99t left there after I determined that that also = didn=E2=80=99t let me get it replaced into the raidz. And, that attempt to replace is what I=E2=80=99ve tried many times, with = multiple id=E2=80=99s. I have cleared the label on da10 multiple times. = That replace doesn=E2=80=99t work, giving this error message in all = cases. - Chris % glabel status Name Status Components diskid/DISK-BTWL503503TW480QGN N/A ada0 gpt/l2arc N/A ada0p1 gptid/9d00849e-0b82-11ec-a143-84b2612f2c38 N/A ada0p1 diskid/DISK-K1GMBN9D N/A da3 diskid/DISK-3WJDHJ2J N/A da6 diskid/DISK-3WK3G1KJ N/A da7 diskid/DISK-3WJ7ZMMJ N/A da8 diskid/DISK-K1GMEDMD N/A da4 diskid/DISK-K1GMAX1D N/A da5 ufs/drive12 N/A da9 diskid/DISK-ZGG0A2PA N/A da10 % zpool status tank pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent = errors. Sufficient replicas exist for the pool to continue functioning = in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the = device repaired. scan: scrub repaired 0B in 17:14:03 with 0 errors on Fri Sep 6 = 09:08:34 2024 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 da3 FAULTED 0 0 0 external = device fault da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 diskid/DISK-K1GMBN9D ONLINE 0 0 0 diskid/DISK-K1GMEDMD ONLINE 0 0 0 diskid/DISK-K1GMAX1D ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 diskid/DISK-3WJDHJ2J ONLINE 0 0 0 diskid/DISK-3WK3G1KJ ONLINE 0 0 0 diskid/DISK-3WJ7ZMMJ ONLINE 0 0 0 errors: No known data errors % sudo zpool replace tank da3 da10 Password: cannot replace da3 with da10: already in replacing/spare config; wait = for completion or use 'zpool detach' % zpool status -g tank pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent = errors. Sufficient replicas exist for the pool to continue functioning = in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the = device repaired. scan: scrub repaired 0B in 17:14:03 with 0 errors on Fri Sep 6 = 09:08:34 2024 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 16506780107187041124 DEGRADED 0 0 0 9127016430593660128 FAULTED 0 0 0 external = device fault 4094297345166589692 ONLINE 0 0 0 17850258180603290288 ONLINE 0 0 0 5104119975785735782 ONLINE 0 0 0 6752552549817423876 ONLINE 0 0 0 9072227575611698625 ONLINE 0 0 0 13778609510621402511 ONLINE 0 0 0 11410204456339324959 ONLINE 0 0 0 1083322824660576293 ONLINE 0 0 0 12505496659970146740 ONLINE 0 0 0 11847701970749615606 ONLINE 0 0 0 errors: No known data errors % sudo zpool replace tank 9127016430593660128 da10 cannot replace 9127016430593660128 with da10: already in replacing/spare = config; wait for completion or use 'zpool detach' % sudo zpool replace tank 9127016430593660128 diskid/DISK-ZGG0A2PA cannot replace 9127016430593660128 with diskid/DISK-ZGG0A2PA: already in = replacing/spare config; wait for completion or use 'zpool detach'