Date: Wed, 18 May 2022 15:03:17 -0400 From: Mark Johnston <markj@freebsd.org> To: freebsd-hackers@freebsd.org Subject: zfs support in makefs Message-ID: <YoVC9VgV1nTptjzx@nuc>
next in thread | raw e-mail | index | archive | help
Hi, For the past little while I've been working on ZFS support in makefs(8). At this point I'm able to create a bootable FreeBSD VM image, using the standard FreeBSD ZFS layout, and run through the regression test suite in bhyve. I've also been able to create and boot an EC2 AMI. Some background is below for anyone interested, and I would greatly appreciate feedback on the interface, described further below. The initial diff is here: https://reviews.freebsd.org/D35248 Comments here or in the review are welcome. === Background === The goal is to enable creation of ZFS-based VM images, in particular by release(7). Currently one can implement this by creating a pool on a file-backed memory disk and populating it with "make installworld", but this has a few drawbacks: 1. The resulting images are not reproducible. That is, if one creates two ZFS images with identical contents, the images themselves will not be byte-identical. For instance, each pool gets a randomly generated GUID, as does each vdev, and there are other sources of non-determinism besides. 2. Creating a zpool requires root privileges by default and can't be done at all in a jail. 3. Populating the image is a resource-intensive operation, the kernel will cache the output files until the pool is exported, etc. For UFS images we use makefs to solve these problems, so I wanted to try and take the same approach for ZFS. I assume that the appeal of using ZFS as the root filesystem for VMs is obvious. I initially implemented ZFS support in makefs using libzpool.so, which is effectively a copy of the OpenZFS kernel code compiled for userspace. It is mostly used for testing and debugging. This worked and was relatively simple to implement, but it only solved problem 2. Bending libzpool to satisfy my requirements seemed difficult, and the result would require continuous maintenance as OpenZFS evolves and its internal interfaces change. I spent some time hacking libzpool to limit its memory and CPU usage and gave up; while it was functional, the result was painfully slow. I then looked at the bits used by the loader to load files off of a boot volume, and implemented the creation of ZFS images from scratch, i.e., without reusing OpenZFS code. This required more effort but I believe it'll be easier to maintain in the long run, and it solves all three problems above. The implementation is mostly derived from an old ZFS on-disk format specification (http://www.giis.co.in/Zfs_ondiskformat.pdf), various blog posts, and lots of time spent staring at zdb output. I reused some code from the boot loader: the nvlist implementation, since the one in sys/contrib doesn't have some required features, and zfsimpl.h, which contains C structs describing various on-disk data structures. ZFS in general is pretty complex so this effort required some specialization to the problem at hand. In particular, makefs - always creates a pool with a single disk vdev with all data written in a single transaction group; there's no snapshots, no RAID-Z/dRAID, no redundant block copies, no ZIL, no encryption, no gang blocks, no zvol, etc. - does not implement compression, - doesn't preserve holes in files, - always creates pools at version 5000, i.e., all feature flags are off and have to be enabled separately, - does not try to do any clever metaslab placement or sizing, on the basis that the pool will likely be expanded upon first boot anyway, - doesn't use spill blocks and is not particularly clever when it comes to choosing block sizes, creating some avoidable internal fragmentation (though it doesn't seem too bad relative to OpenZFS without compression, maybe 10% overhead in some unscientific tests) Some of these can be addressed (especially compression and sparse file support), but I wanted to get some feedback before spending more time on this. Really this thing is just intended to do the minimum necessary to provide ZFS-based VM images. === Interface === Creating a pool with a single dataset is easy: $ makefs -t zfs -s 10g -o poolname=test ./zfs.img /path/to/input Upon importing such a pool, you'll get a dataset named "test" mounted at /test containing everything under /path/to/input. It's possible to set properties on the root dataset: $ makefs -t zfs -s 10g -o poolname=test -o fs=test:setuid=off:atime=on ./zfs.img /path/to/input It's also possible to create additional datasets: $ makefs -t zfs -s 10g -o poolname=test -o fs=test/ds1:mountpoint=/test/dir1 ./zfs.img /path/to/input The parameter syntax is "-o fs=<dataset name>[:<prop1>=<val1>[:<prop2>=<val2>[:...]]]". Only a few properties are supported, at least for now. Dataset mountpoints behave the same as they would if created with the standard ZFS tools. So by default the root dataset's mountpoint is /test, test/ds1's mountpoint is /test/ds1, etc.. If a dataset overrides its default mountpoint, its children inherit that mountpoint. makefs builds the output filesystem using a single input directory tree. Thus, makefs -t zfs requires that at least one of the dataset's mountpoints map to /path/to/input; that is, there is a "root" mount point. The -o rootpath parameter defines this root mount point. By default it's "/<poolname>". All datasets in the pool must have their mountpoints under this path, and one dataset's mountpoint must be equal to this path. To build bootable images, one sets -o rootpath=/. Putting it all together, one can build a image using the standard layout with an invocation like this: makefs -t zfs -o poolname=zroot -s 20g -o rootpath=/ -o bootfs=zroot/ROOT/default \ -o fs=zroot:canmount=off:mountpoint=none \ -o fs=zroot/ROOT:mountpoint=none \ -o fs=zroot/ROOT/default:mountpoint=/ \ -o fs=zroot/tmp:mountpoint=/tmp:exec=on:setuid=off \ -o fs=zroot/usr:mountpoint=/usr:canmount=off \ -o fs=zroot/usr/home \ -o fs=zroot/usr/ports:setuid=off \ -o fs=zroot/usr/src \ -o fs=zroot/usr/obj \ -o fs=zroot/var:mountpoint=/var:canmount=off \ -o fs=zroot/var/audit:setuid=off:exec=off \ -o fs=zroot/var/crash:setuid=off:exec=off \ -o fs=zroot/var/log:setuid=off:exec=off \ -o fs=zroot/var/mail:atime=on \ -o fs=zroot/var/tmp:setuid=off \ ${HOME}/tmp/zfs.img ${HOME}/tmp/world I'll admit this is somewhat clunky, but it doesn't seem worse than what we have to do otherwise, see poudriere-image for example: https://github.com/freebsd/poudriere/blob/master/src/share/poudriere/image_zfs.sh#L79 What do folks think of this interface? Is there anything missing, or anything that doesn't make sense?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YoVC9VgV1nTptjzx>