From nobody Wed May 18 19:03:17 2022 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 9F8B41812C7C for ; Wed, 18 May 2022 19:03:23 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-il1-x129.google.com (mail-il1-x129.google.com [IPv6:2607:f8b0:4864:20::129]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4L3Mp26s4yz3l8b for ; Wed, 18 May 2022 19:03:22 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by mail-il1-x129.google.com with SMTP id d3so2148097ilr.10 for ; Wed, 18 May 2022 12:03:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:date:from:to:subject:message-id:mime-version :content-disposition; bh=/kvk9RXAZULNgzeXpI8wniZe6WmI1vHEMsLcFfmJazA=; b=fT3eMZSfjX6yUTNdUJ+veRDCumUnv1rdWP/mhFU59hlq0EeZtOWI8NBBOpwupvJro1 M21RaMwkFK/JrKpow12zryvtQvj4Bngei8ED3I036Z3HRcsRKoU7cy8Nh8gm24/nS651 7Mutdkjk91zwCgb7sm3i3qRugyst0r0WChiwzp2Hto4zJ0yju0BE14j0PwrDVhH+tjQW Ip2O8/60y2BUPrxCjeCCq6r67KvwLqSRi7r/WWiXThhTKV2c6tE83EUvM7fpSAfGOPQl c4SCTDH7uta9kjcRlCKYi5mQk03EySKr7LrQVdosyR6egdzztpy/NScTY86HJ5968vdH +Jag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:date:from:to:subject:message-id :mime-version:content-disposition; bh=/kvk9RXAZULNgzeXpI8wniZe6WmI1vHEMsLcFfmJazA=; b=JLOtxwliqFpZGPFDFt3RlS6mztXyedqSvaJ4qN1L235fDwjsPrGMEJj+mwRn9XgKO3 ZYxSS+aV1LT2xg+5aOqd7TvbHPSE+Bit2RF0/uIxmjSshII7ZbDXpe7p9LeytKqhQpfD qV9amDzS29Jr1HGdUMqTvFCLP0n42SGvoOvv9tOowi2yb1zYOdhelG4aqLUadiXR53aK LDdl1PDvhA+bvNBDS3b6c90hVUnUaR2Lore9jxXbLTdifookwNqAnnPf+eFhYRLRXvXO hg2eWfVpLrbmFoeIC+2U40pxzk55xyQ54yYD/8VEmPv7pxoZkD6o2VZCERtqArQOd7Rn GS7g== X-Gm-Message-State: AOAM530tuzXr+sR9KEk+fDRx65qEuH/zUfsMkVC9D3lfaNd7JALDeZpZ jGoShxzzGMhn2fq8aMdUbzaK/Z7zR3A= X-Google-Smtp-Source: ABdhPJy47pyNuY7ROZFxGuBRn+/7IBrXzNhUPcRtYN77YoyfkY1Aa9T0MYX7rX/W1PlB2zH0lGN0pQ== X-Received: by 2002:a92:cd4f:0:b0:2d1:26d:be58 with SMTP id v15-20020a92cd4f000000b002d1026dbe58mr628356ilq.223.1652900601635; Wed, 18 May 2022 12:03:21 -0700 (PDT) Received: from nuc (198-84-189-58.cpe.teksavvy.com. [198.84.189.58]) by smtp.gmail.com with ESMTPSA id t11-20020a5edd0b000000b0065a47e16f63sm36459iop.53.2022.05.18.12.03.20 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 May 2022 12:03:20 -0700 (PDT) Date: Wed, 18 May 2022 15:03:17 -0400 From: Mark Johnston To: freebsd-hackers@freebsd.org Subject: zfs support in makefs Message-ID: List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Queue-Id: 4L3Mp26s4yz3l8b X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=fT3eMZSf; dmarc=none; spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::129 as permitted sender) smtp.mailfrom=markjdb@gmail.com X-Spamd-Result: default: False [-2.70 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; TO_DN_NONE(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; NEURAL_HAM_SHORT(-1.00)[-0.997]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; TO_DOM_EQ_FROM_DOM(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; DMARC_NA(0.00)[freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::129:from]; MLMMJ_DEST(0.00)[freebsd-hackers]; MID_RHS_NOT_FQDN(0.50)[]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Hi, For the past little while I've been working on ZFS support in makefs(8). At this point I'm able to create a bootable FreeBSD VM image, using the standard FreeBSD ZFS layout, and run through the regression test suite in bhyve. I've also been able to create and boot an EC2 AMI. Some background is below for anyone interested, and I would greatly appreciate feedback on the interface, described further below. The initial diff is here: https://reviews.freebsd.org/D35248 Comments here or in the review are welcome. === Background === The goal is to enable creation of ZFS-based VM images, in particular by release(7). Currently one can implement this by creating a pool on a file-backed memory disk and populating it with "make installworld", but this has a few drawbacks: 1. The resulting images are not reproducible. That is, if one creates two ZFS images with identical contents, the images themselves will not be byte-identical. For instance, each pool gets a randomly generated GUID, as does each vdev, and there are other sources of non-determinism besides. 2. Creating a zpool requires root privileges by default and can't be done at all in a jail. 3. Populating the image is a resource-intensive operation, the kernel will cache the output files until the pool is exported, etc. For UFS images we use makefs to solve these problems, so I wanted to try and take the same approach for ZFS. I assume that the appeal of using ZFS as the root filesystem for VMs is obvious. I initially implemented ZFS support in makefs using libzpool.so, which is effectively a copy of the OpenZFS kernel code compiled for userspace. It is mostly used for testing and debugging. This worked and was relatively simple to implement, but it only solved problem 2. Bending libzpool to satisfy my requirements seemed difficult, and the result would require continuous maintenance as OpenZFS evolves and its internal interfaces change. I spent some time hacking libzpool to limit its memory and CPU usage and gave up; while it was functional, the result was painfully slow. I then looked at the bits used by the loader to load files off of a boot volume, and implemented the creation of ZFS images from scratch, i.e., without reusing OpenZFS code. This required more effort but I believe it'll be easier to maintain in the long run, and it solves all three problems above. The implementation is mostly derived from an old ZFS on-disk format specification (http://www.giis.co.in/Zfs_ondiskformat.pdf), various blog posts, and lots of time spent staring at zdb output. I reused some code from the boot loader: the nvlist implementation, since the one in sys/contrib doesn't have some required features, and zfsimpl.h, which contains C structs describing various on-disk data structures. ZFS in general is pretty complex so this effort required some specialization to the problem at hand. In particular, makefs - always creates a pool with a single disk vdev with all data written in a single transaction group; there's no snapshots, no RAID-Z/dRAID, no redundant block copies, no ZIL, no encryption, no gang blocks, no zvol, etc. - does not implement compression, - doesn't preserve holes in files, - always creates pools at version 5000, i.e., all feature flags are off and have to be enabled separately, - does not try to do any clever metaslab placement or sizing, on the basis that the pool will likely be expanded upon first boot anyway, - doesn't use spill blocks and is not particularly clever when it comes to choosing block sizes, creating some avoidable internal fragmentation (though it doesn't seem too bad relative to OpenZFS without compression, maybe 10% overhead in some unscientific tests) Some of these can be addressed (especially compression and sparse file support), but I wanted to get some feedback before spending more time on this. Really this thing is just intended to do the minimum necessary to provide ZFS-based VM images. === Interface === Creating a pool with a single dataset is easy: $ makefs -t zfs -s 10g -o poolname=test ./zfs.img /path/to/input Upon importing such a pool, you'll get a dataset named "test" mounted at /test containing everything under /path/to/input. It's possible to set properties on the root dataset: $ makefs -t zfs -s 10g -o poolname=test -o fs=test:setuid=off:atime=on ./zfs.img /path/to/input It's also possible to create additional datasets: $ makefs -t zfs -s 10g -o poolname=test -o fs=test/ds1:mountpoint=/test/dir1 ./zfs.img /path/to/input The parameter syntax is "-o fs=[:=[:=[:...]]]". Only a few properties are supported, at least for now. Dataset mountpoints behave the same as they would if created with the standard ZFS tools. So by default the root dataset's mountpoint is /test, test/ds1's mountpoint is /test/ds1, etc.. If a dataset overrides its default mountpoint, its children inherit that mountpoint. makefs builds the output filesystem using a single input directory tree. Thus, makefs -t zfs requires that at least one of the dataset's mountpoints map to /path/to/input; that is, there is a "root" mount point. The -o rootpath parameter defines this root mount point. By default it's "/". All datasets in the pool must have their mountpoints under this path, and one dataset's mountpoint must be equal to this path. To build bootable images, one sets -o rootpath=/. Putting it all together, one can build a image using the standard layout with an invocation like this: makefs -t zfs -o poolname=zroot -s 20g -o rootpath=/ -o bootfs=zroot/ROOT/default \ -o fs=zroot:canmount=off:mountpoint=none \ -o fs=zroot/ROOT:mountpoint=none \ -o fs=zroot/ROOT/default:mountpoint=/ \ -o fs=zroot/tmp:mountpoint=/tmp:exec=on:setuid=off \ -o fs=zroot/usr:mountpoint=/usr:canmount=off \ -o fs=zroot/usr/home \ -o fs=zroot/usr/ports:setuid=off \ -o fs=zroot/usr/src \ -o fs=zroot/usr/obj \ -o fs=zroot/var:mountpoint=/var:canmount=off \ -o fs=zroot/var/audit:setuid=off:exec=off \ -o fs=zroot/var/crash:setuid=off:exec=off \ -o fs=zroot/var/log:setuid=off:exec=off \ -o fs=zroot/var/mail:atime=on \ -o fs=zroot/var/tmp:setuid=off \ ${HOME}/tmp/zfs.img ${HOME}/tmp/world I'll admit this is somewhat clunky, but it doesn't seem worse than what we have to do otherwise, see poudriere-image for example: https://github.com/freebsd/poudriere/blob/master/src/share/poudriere/image_zfs.sh#L79 What do folks think of this interface? Is there anything missing, or anything that doesn't make sense?