Date: Sat, 14 Mar 2020 09:58:17 -0600 From: Alan Somers <asomers@freebsd.org> To: Andriy Gapon <avg@freebsd.org> Cc: Willem Jan Withagen <wjw@digiware.nl>, FreeBSD Filesystems <freebsd-fs@freebsd.org> Subject: Re: ZFS pools in "trouble" Message-ID: <CAOtMX2g%2BbChQ1nZwVdhXeBbGvJf1V6TEDmQUOf1p0Dm8-HX8Zw@mail.gmail.com> In-Reply-To: <24916dd7-f22c-b55b-73ae-1a2bfe653f9c@FreeBSD.org> References: <71e1f22a-1261-67d9-e41d-0f326bf81469@digiware.nl> <91e1cd09-b6b8-f107-537f-ae2755aba087@FreeBSD.org> <15bde4a5-0a2e-9984-dfd6-fce39f079f52@digiware.nl> <24916dd7-f22c-b55b-73ae-1a2bfe653f9c@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Mar 14, 2020 at 9:14 AM Andriy Gapon <avg@freebsd.org> wrote: > On 14/03/2020 13:00, Willem Jan Withagen wrote: > > On 27-2-2020 09:11, Andriy Gapon wrote: > >> On 26/02/2020 19:09, Willem Jan Withagen wrote: > >>> Hi, > >>> > >>> I'm using my pools in perhaps a rather awkward way as underlying > storage for my > >>> ceph cluster: > >>> 1 disk per pool, with log and cache on SSD > >>> > >>> For one reason or another one of the servers has crashed ad does not > really want > >>> to read several of the pools: > >>> ---- > >>> pool: osd_2 > >>> state: UNAVAIL > >>> Assertion failed: (reason == ZPOOL_STATUS_OK), file > >>> /usr/src/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c, line 5098. > >>> Abort (core dumped) > >>> ---- > >>> > >>> The code there is like: > >>> ---- > >>> default: > >>> /* > >>> * The remaining errors can't actually be generated, > yet. > >>> */ > >>> assert(reason == ZPOOL_STATUS_OK); > >>> > >>> ---- > >>> And this on already 3 disks. > >>> Running: > >>> FreeBSD 12.1-STABLE (GENERIC) #0 r355208M: Fri Nov 29 10:43:47 CET 2019 > >>> > >>> Now this is a test cluster, so no harm there in matters of data loss. > >>> And the ceph cluster probably can rebuild everything if I do not lose > too many > >>> disk. > >>> > >>> But the problem also lies in the fact that not all disk are recognized > by the > >>> kernel, and not all disk end up mounted. So I need to remove a pool > first to get > >>> more disks online. > >>> > >>> Is there anything I can do the get them back online? > >>> Or is this a lost cause? > >> Depends on what 'reason' is. > >> I mean the value of the variable. > > > > I ran into the same problem. Even though I deleted the zpool in error. > > > > Ao I augmented this code with a pringtf > > > > Error: Reason not found: 5 > > It seems that 5 is ZPOOL_STATUS_BAD_GUID_SUM and there is a discrepancy > between > what the code in status_callback() expects and what actually happens. > Looks like check_status() can actually return ZPOOL_STATUS_BAD_GUID_SUM: > /* > * Check that the config is complete. > */ > if (vs->vs_state == VDEV_STATE_CANT_OPEN && > vs->vs_aux == VDEV_AUX_BAD_GUID_SUM) > return (ZPOOL_STATUS_BAD_GUID_SUM); > > I think that VDEV_AUX_BAD_GUID_SUM typically means that a device is > missing from > the pool. E.g., a log device. Or there is some other discrepancy between > expected pool vdevs and found pool vdevs. > This situation can also arise if you remove a device from the pool (either physically, or because the device failed), then you change the pool's configuration, such as by replacing the missing vdev, then you reinsert the original device and reboot. ZFS may pick up the old device's label, and the guid sums won't match. Another way you can get this error is by creating a ZFS pool on a disk, then repartitioning the disk and creating a new pool. It's possible that "zpool import" will see the old, invalid label before it sees the newer valid label. I've even run into a situation where "zpool import" worked fine, but the bootloader failed, because they interrogate labels in a slightly different way. -Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2g%2BbChQ1nZwVdhXeBbGvJf1V6TEDmQUOf1p0Dm8-HX8Zw>