Date: Sun, 16 Aug 2020 21:39:56 +1000 From: Peter Jeremy <peter@rulingia.com> To: freebsd-fs@freebsd.org Subject: Gotcha with root on ZFS Message-ID: <20200816113956.GA63113@server.rulingia.com>
next in thread | raw e-mail | index | archive | help
--ew6BAiZeqk4r7MaW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable TL;DR: If you change the physical start or end locations of a root pool, ensure all the old pool labels are explicitly destroyed. The following post in fairly long and includes details not directly relevant to my problem but I couldn't find much detail on the low-level FreeBSD root mount procedure so I'm hoping this will also shed some light on the general root mount process. I've been having problems booting my laptop after updating my kernel: Specific commits that have nothing to do with booting, filesystems or devices would prevent booting. And adding devices (that don't physically exist in my laptop) would then allow it to boot. Adding INVARIANTS, WITNESS and DIAGNOSTIC didn't turn up any issues. After sprinkling printf's through the ZFS code, I've identified the issue and believe it's an oddity in the FreeBSD-specific ZFS code that will only get triggered in rather unusual circumstances. The filesystem-independent call tree for mounting the root filesystem is: start_init() =3D> vfs_mountroot() =3D> parse_mount() =3D> kernel_mount() = =3D> vfs_donmount() =3D> vfs_domount(). vfs_domount() passes the root filesystem name (eg zroot/ROOT/r359130) to zfs_mount(), which imports the pool and mounts the root filesystem based solely on the filesystem name. Whilst the kernel environment includes GUID details in vfs.zfs.boot.primary_pool and vfs.zfs.boot.primary_vdev, these are ignored. Within the ZFS code, the root filesytem pool name (eg zroot) is first translated to a pool_guid/vdev_guid pair and then that pool_guid/vdev_guid pair is used to configure the root pool, from which the root filesystem is mounted. For the translation step, vdev_geom_read_pool_label() tastes every GEOM provider, looking for ZFS labels with a pool name matching the wanted root pool name. The first time a matching pool name is found, the associated pool GUID is saved and subsequent checks match both the pool name and the pool GUID. It returns an array of pool configuration data (one entry per matching provider). spa_generate_rootconf() scans the array of configs and selects the one with the highest TXG. That config is converted into wanted vdev descriptor. vdev_geom_open() uses the pool and vdev GUIDs out of that vdev descriptor to locate and open the root pool (first trying the pathname by which the pool was last mounted and then scanning all GEOM providers). In my case, I rebuilt my root pool a month or two ago to increase the available swap space, and tweaked a few other partition parameters. As a result, my root pool no longer ends right at the end of the disk (with that space currently free). One consequence of that is that tasting ada0 as a whole finds a one (out of 4) labels belonging to my previous root pool (which has the same name as my current root pool). So, if ada0 is tastes before ada0p3 during the translation step, the kernel will go looking for my old root pool, which no longer exists, so vdev_geom_open() (and hence the entire mountroot process( will fail. And it turns out that the order of GEOM providers isn't fixed but seems to change randomly depending on what devices are present in the kernel (even if they aren't probed). So, some combinations of kernel devices results in ada0 coming before ada0p3 in the GEOM provider list, and they fail to boot. I'm not sure if this is a scenario that the boot code should explicitly handle - there are a variety of ways vdev_geom_read_pool_label() could handle this situation: * Check against vfs.zfs.boot.primary_pool (though this would impact the ability to use the manual mountroot process to recover). * Where multiple pool GUIDs are found, look for a "best" pool by considering the number of labels found (vdev_geom_read_config() reports this) and maybe other sanity checks. At the very least, I believe this scenario should be documented (since it took me quite a while to debug). --=20 Peter Jeremy --ew6BAiZeqk4r7MaW Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQKTBAEBCgB9FiEE2M6l8vfIeOACl4uUHZIUommfjLIFAl85GwBfFIAAAAAALgAo aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldEQ4 Q0VBNUYyRjdDODc4RTAwMjk3OEI5NDFEOTIxNEEyNjk5RjhDQjIACgkQHZIUommf jLKyKQ//ek8EvhX8JPC3UDxHaO4SxIepxs8R/Ts+n3iaKj9aEZ/UzcEvYoKFh+EY Goo5fgjuy0oVFg8zE2nM1znoy7YkQZ/Of2ka/ZJYOWYuYZqZeZgVgIGNIMxYl47b +2bsA0er2fu9/qW2UoUFfUfMTHrOcpUsZx4j3LBqM/sjvXUiEyr1osSaIkACrMpV N2p3e0PKt9fRdXez/tB3fz/EBuOXkpFgHUCbelTGvI9psm4aC47oO1gnKeouwckv 6riQPDpSTMH9bqnC5sa1kaqlJtH22q5se/18o5LhgeBQ+M/cwytEGlBGXJHr9oQo kUBBboSDy1LzkRvh62PVbP6Ch9pQJr0uSJ4Rii9uiXB1mv5xZKrhNU/DN9q5Vl56 MHrtx++n+zIkRS7nCQCaOALYYAiHhadgCgAY2ren8mTp/b4I3bKcA+vF9kzixqVk USRldUEovf0nde6MjE0BrPKRYGQXkM5pI3AkfbtJ8M85vpelAxksq44TkVZFHg4K No5gFwnfwC7bxlhEIA5OZjFKNnL5RgONOVAsdKD151j3cNU3i2V66UKbEtRERGYL jD4E4yNaeQpNpjb7FoLsI586TSgAryB5tHtrzCaurqQygoXcb+9oYVFaGeSMq8JY noT/YGPq4PeEg1D1LUr8/HzRAI/c4OwIV/S2Y/vCOqq4aJ/8DeI= =oWJS -----END PGP SIGNATURE----- --ew6BAiZeqk4r7MaW--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200816113956.GA63113>