Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 Aug 2020 21:39:56 +1000
From:      Peter Jeremy <peter@rulingia.com>
To:        freebsd-fs@freebsd.org
Subject:   Gotcha with root on ZFS
Message-ID:  <20200816113956.GA63113@server.rulingia.com>

next in thread | raw e-mail | index | archive | help

--ew6BAiZeqk4r7MaW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

TL;DR: If you change the physical start or end locations of a root pool,
ensure all the old pool labels are explicitly destroyed.

The following post in fairly long and includes details not directly
relevant to my problem but I couldn't find much detail on the low-level
FreeBSD root mount procedure so I'm hoping this will also shed some light
on the general root mount process.

I've been having problems booting my laptop after updating my kernel:
Specific commits that have nothing to do with booting, filesystems or
devices would prevent booting.  And adding devices (that don't physically
exist in my laptop) would then allow it to boot.  Adding INVARIANTS,
WITNESS and DIAGNOSTIC didn't turn up any issues.

After sprinkling printf's through the ZFS code, I've identified the issue
and believe it's an oddity in the FreeBSD-specific ZFS code that will only
get triggered in rather unusual circumstances.

The filesystem-independent call tree for mounting the root filesystem is:
start_init() =3D> vfs_mountroot() =3D> parse_mount() =3D> kernel_mount() =
=3D>
vfs_donmount() =3D> vfs_domount().  vfs_domount() passes the root filesystem
name (eg zroot/ROOT/r359130) to zfs_mount(), which imports the pool and
mounts the root filesystem based solely on the filesystem name.  Whilst the
kernel environment includes GUID details in vfs.zfs.boot.primary_pool and
vfs.zfs.boot.primary_vdev, these are ignored.

Within the ZFS code, the root filesytem pool name (eg zroot) is first
translated to a pool_guid/vdev_guid pair and then that pool_guid/vdev_guid
pair is used to configure the root pool, from which the root filesystem
is mounted.

For the translation step, vdev_geom_read_pool_label() tastes every GEOM
provider, looking for ZFS labels with a pool name matching the wanted root
pool name.  The first time a matching pool name is found, the associated
pool GUID is saved and subsequent checks match both the pool name and the
pool GUID.  It returns an array of pool configuration data (one entry per
matching provider).

spa_generate_rootconf() scans the array of configs and selects the one with
the highest TXG.  That config is converted into wanted vdev descriptor.
vdev_geom_open() uses the pool and vdev GUIDs out of that vdev descriptor to
locate and open the root pool (first trying the pathname by which the pool
was last mounted and then scanning all GEOM providers).

In my case, I rebuilt my root pool a month or two ago to increase the
available swap space, and tweaked a few other partition parameters.  As a
result, my root pool no longer ends right at the end of the disk (with that
space currently free).  One consequence of that is that tasting ada0 as a
whole finds a one (out of 4) labels belonging to my previous root pool
(which has the same name as my current root pool).  So, if ada0 is tastes
before ada0p3 during the translation step, the kernel will go looking for my
old root pool, which no longer exists, so vdev_geom_open() (and hence the
entire mountroot process( will fail.  And it turns out that the order of
GEOM providers isn't fixed but seems to change randomly depending on what
devices are present in the kernel (even if they aren't probed).  So, some
combinations of kernel devices results in ada0 coming before ada0p3 in the
GEOM provider list, and they fail to boot.

I'm not sure if this is a scenario that the boot code should explicitly
handle - there are a variety of ways vdev_geom_read_pool_label() could
handle this situation:
* Check against vfs.zfs.boot.primary_pool (though this would impact the
  ability to use the manual mountroot process to recover).
* Where multiple pool GUIDs are found, look for a "best" pool by considering
  the number of labels found (vdev_geom_read_config() reports this) and
  maybe other sanity checks.

At the very least, I believe this scenario should be documented (since it
took me quite a while to debug).

--=20
Peter Jeremy

--ew6BAiZeqk4r7MaW
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQKTBAEBCgB9FiEE2M6l8vfIeOACl4uUHZIUommfjLIFAl85GwBfFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldEQ4
Q0VBNUYyRjdDODc4RTAwMjk3OEI5NDFEOTIxNEEyNjk5RjhDQjIACgkQHZIUommf
jLKyKQ//ek8EvhX8JPC3UDxHaO4SxIepxs8R/Ts+n3iaKj9aEZ/UzcEvYoKFh+EY
Goo5fgjuy0oVFg8zE2nM1znoy7YkQZ/Of2ka/ZJYOWYuYZqZeZgVgIGNIMxYl47b
+2bsA0er2fu9/qW2UoUFfUfMTHrOcpUsZx4j3LBqM/sjvXUiEyr1osSaIkACrMpV
N2p3e0PKt9fRdXez/tB3fz/EBuOXkpFgHUCbelTGvI9psm4aC47oO1gnKeouwckv
6riQPDpSTMH9bqnC5sa1kaqlJtH22q5se/18o5LhgeBQ+M/cwytEGlBGXJHr9oQo
kUBBboSDy1LzkRvh62PVbP6Ch9pQJr0uSJ4Rii9uiXB1mv5xZKrhNU/DN9q5Vl56
MHrtx++n+zIkRS7nCQCaOALYYAiHhadgCgAY2ren8mTp/b4I3bKcA+vF9kzixqVk
USRldUEovf0nde6MjE0BrPKRYGQXkM5pI3AkfbtJ8M85vpelAxksq44TkVZFHg4K
No5gFwnfwC7bxlhEIA5OZjFKNnL5RgONOVAsdKD151j3cNU3i2V66UKbEtRERGYL
jD4E4yNaeQpNpjb7FoLsI586TSgAryB5tHtrzCaurqQygoXcb+9oYVFaGeSMq8JY
noT/YGPq4PeEg1D1LUr8/HzRAI/c4OwIV/S2Y/vCOqq4aJ/8DeI=
=oWJS
-----END PGP SIGNATURE-----

--ew6BAiZeqk4r7MaW--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200816113956.GA63113>