Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 14 Mar 2020 17:14:10 +0200
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Willem Jan Withagen <wjw@digiware.nl>, FreeBSD Filesystems <freebsd-fs@FreeBSD.org>
Subject:   Re: ZFS pools in "trouble"
Message-ID:  <24916dd7-f22c-b55b-73ae-1a2bfe653f9c@FreeBSD.org>
In-Reply-To: <15bde4a5-0a2e-9984-dfd6-fce39f079f52@digiware.nl>
References:  <71e1f22a-1261-67d9-e41d-0f326bf81469@digiware.nl> <91e1cd09-b6b8-f107-537f-ae2755aba087@FreeBSD.org> <15bde4a5-0a2e-9984-dfd6-fce39f079f52@digiware.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
On 14/03/2020 13:00, Willem Jan Withagen wrote:
> On 27-2-2020 09:11, Andriy Gapon wrote:
>> On 26/02/2020 19:09, Willem Jan Withagen wrote:
>>> Hi,
>>>
>>> I'm using my pools in perhaps a rather awkward way as underlying storage for my
>>> ceph cluster:
>>>      1 disk per pool, with log and cache on SSD
>>>
>>> For one reason or another one of the servers has crashed ad does not really want
>>> to read several of the pools:
>>> ----
>>>    pool: osd_2
>>>   state: UNAVAIL
>>> Assertion failed: (reason == ZPOOL_STATUS_OK), file
>>> /usr/src/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c, line 5098.
>>> Abort (core dumped)
>>> ----
>>>
>>> The code there is like:
>>> ----
>>>          default:
>>>                  /*
>>>                   * The remaining errors can't actually be generated, yet.
>>>                   */
>>>                  assert(reason == ZPOOL_STATUS_OK);
>>>
>>> ----
>>> And this on already 3 disks.
>>> Running:
>>> FreeBSD 12.1-STABLE (GENERIC) #0 r355208M: Fri Nov 29 10:43:47 CET 2019
>>>
>>> Now this is a test cluster, so no harm there in matters of data loss.
>>> And the ceph cluster probably can rebuild everything if I do not lose too many
>>> disk.
>>>
>>> But the problem also lies in the fact that not all disk are recognized by the
>>> kernel, and not all disk end up mounted. So I need to remove a pool first to get
>>> more disks online.
>>>
>>> Is there anything I can do the get them back online?
>>> Or is this a lost cause?
>> Depends on what 'reason' is.
>> I mean the value of the variable.
> 
> I ran into the same problem. Even though I deleted the zpool in error.
> 
> Ao I augmented this code with a pringtf
> 
> Error: Reason not found: 5

It seems that 5 is ZPOOL_STATUS_BAD_GUID_SUM and there is a discrepancy between
what the code in status_callback() expects and what actually happens.
Looks like check_status() can actually return ZPOOL_STATUS_BAD_GUID_SUM:
        /*
         * Check that the config is complete.
         */
        if (vs->vs_state == VDEV_STATE_CANT_OPEN &&
            vs->vs_aux == VDEV_AUX_BAD_GUID_SUM)
                return (ZPOOL_STATUS_BAD_GUID_SUM);

I think that VDEV_AUX_BAD_GUID_SUM typically means that a device is missing from
the pool.  E.g., a log device.  Or there is some other discrepancy between
expected pool vdevs and found pool vdevs.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?24916dd7-f22c-b55b-73ae-1a2bfe653f9c>