From owner-freebsd-fs@freebsd.org Sat Mar 14 15:58:30 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 0173325D27D for ; Sat, 14 Mar 2020 15:58:30 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f43.google.com (mail-ot1-f43.google.com [209.85.210.43]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 48fnKd6HbZz400P; Sat, 14 Mar 2020 15:58:29 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f43.google.com with SMTP id a6so13319834otb.10; Sat, 14 Mar 2020 08:58:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gpZc/QhGDbT6JhS7wXiCy83+DJWDFO9jUR4HRJvvigk=; b=c+9A7guYJz0Wa1JJnLCwhON7TMVRmksjMd0OxP9Ec5FXbe1SQnHZAdXr2DBnsq3Hlc QXJYg8x2rY/poQLJOwOSdIWLYbAlEr+xgSmgvhs8zSQggQUi/J+K8/06n3PBmKt642gA 3YpFmzaiFrPX9U/c1rp/V+IoGaNXQJOmVXoVINxTqtiRoX+VQP6j2bE+f6tRSbhmNegt Nbp9bnrW9BJCHVFzA4sueZcFsu2jgS4ZZCjlQC2GVVamtX1iqy+MmXXS3pbJRMM4txKW s6qpvPVySF+jkhrwMyDBAJd8FIi2WMQ5/3g4bhFSxqE+BUixrc9JDURXCERTFv4rlNLR VY1Q== X-Gm-Message-State: ANhLgQ1XiSO7gtQpUuk/qUd2L7yi9WfVS2Efzg7/doDalAfHpxtkqlAz /B0UCcXHw1h5mXb/5F98fTClNf8M5+IMLYs71O+62rhb X-Google-Smtp-Source: ADFU+vuNJkXHzypTb9e6btEqGJ3Ix6iR+Ira4Xvz85pZ1kgfW+EUoXO2aaYpI06MZgTQ0CvMswYu8zESv2g2/zN+J7k= X-Received: by 2002:a9d:7497:: with SMTP id t23mr11216882otk.291.1584201508192; Sat, 14 Mar 2020 08:58:28 -0700 (PDT) MIME-Version: 1.0 References: <71e1f22a-1261-67d9-e41d-0f326bf81469@digiware.nl> <91e1cd09-b6b8-f107-537f-ae2755aba087@FreeBSD.org> <15bde4a5-0a2e-9984-dfd6-fce39f079f52@digiware.nl> <24916dd7-f22c-b55b-73ae-1a2bfe653f9c@FreeBSD.org> In-Reply-To: <24916dd7-f22c-b55b-73ae-1a2bfe653f9c@FreeBSD.org> From: Alan Somers Date: Sat, 14 Mar 2020 09:58:17 -0600 Message-ID: Subject: Re: ZFS pools in "trouble" To: Andriy Gapon Cc: Willem Jan Withagen , FreeBSD Filesystems X-Rspamd-Queue-Id: 48fnKd6HbZz400P X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-6.00 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; REPLY(-4.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Mar 2020 15:58:30 -0000 On Sat, Mar 14, 2020 at 9:14 AM Andriy Gapon wrote: > On 14/03/2020 13:00, Willem Jan Withagen wrote: > > On 27-2-2020 09:11, Andriy Gapon wrote: > >> On 26/02/2020 19:09, Willem Jan Withagen wrote: > >>> Hi, > >>> > >>> I'm using my pools in perhaps a rather awkward way as underlying > storage for my > >>> ceph cluster: > >>> 1 disk per pool, with log and cache on SSD > >>> > >>> For one reason or another one of the servers has crashed ad does not > really want > >>> to read several of the pools: > >>> ---- > >>> pool: osd_2 > >>> state: UNAVAIL > >>> Assertion failed: (reason == ZPOOL_STATUS_OK), file > >>> /usr/src/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c, line 5098. > >>> Abort (core dumped) > >>> ---- > >>> > >>> The code there is like: > >>> ---- > >>> default: > >>> /* > >>> * The remaining errors can't actually be generated, > yet. > >>> */ > >>> assert(reason == ZPOOL_STATUS_OK); > >>> > >>> ---- > >>> And this on already 3 disks. > >>> Running: > >>> FreeBSD 12.1-STABLE (GENERIC) #0 r355208M: Fri Nov 29 10:43:47 CET 2019 > >>> > >>> Now this is a test cluster, so no harm there in matters of data loss. > >>> And the ceph cluster probably can rebuild everything if I do not lose > too many > >>> disk. > >>> > >>> But the problem also lies in the fact that not all disk are recognized > by the > >>> kernel, and not all disk end up mounted. So I need to remove a pool > first to get > >>> more disks online. > >>> > >>> Is there anything I can do the get them back online? > >>> Or is this a lost cause? > >> Depends on what 'reason' is. > >> I mean the value of the variable. > > > > I ran into the same problem. Even though I deleted the zpool in error. > > > > Ao I augmented this code with a pringtf > > > > Error: Reason not found: 5 > > It seems that 5 is ZPOOL_STATUS_BAD_GUID_SUM and there is a discrepancy > between > what the code in status_callback() expects and what actually happens. > Looks like check_status() can actually return ZPOOL_STATUS_BAD_GUID_SUM: > /* > * Check that the config is complete. > */ > if (vs->vs_state == VDEV_STATE_CANT_OPEN && > vs->vs_aux == VDEV_AUX_BAD_GUID_SUM) > return (ZPOOL_STATUS_BAD_GUID_SUM); > > I think that VDEV_AUX_BAD_GUID_SUM typically means that a device is > missing from > the pool. E.g., a log device. Or there is some other discrepancy between > expected pool vdevs and found pool vdevs. > This situation can also arise if you remove a device from the pool (either physically, or because the device failed), then you change the pool's configuration, such as by replacing the missing vdev, then you reinsert the original device and reboot. ZFS may pick up the old device's label, and the guid sums won't match. Another way you can get this error is by creating a ZFS pool on a disk, then repartitioning the disk and creating a new pool. It's possible that "zpool import" will see the old, invalid label before it sees the newer valid label. I've even run into a situation where "zpool import" worked fine, but the bootloader failed, because they interrogate labels in a slightly different way. -Alan