Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Nov 2009 16:46:54 -0800
From:      Matt Reimer <mattjreimer@gmail.com>
To:        fs@freebsd.org
Subject:   Current gptzfsboot limitations
Message-ID:  <f383264b0911201646s702c8aa4u5e50a71f93a9e4eb@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
--00504502b672cac4d00478d6ed9a
Content-Type: text/plain; charset=ISO-8859-1

I've been analyzing gptzfsboot to see what its limitations are. I
think it should now work fine for a healthy pool with any number of
disks, with any type of vdev, whether single disk, stripe, mirror,
raidz or raidz2.

But there are currently several limitations (likely in loader.zfs
too), mostly due to the limited amount of memory available (< 640KB)
and the simple memory allocators used (a simple malloc() and
zfs_alloc_temp()).

1. gptzfsboot might fail to read compressed files on raidz/raidz2
pools. The reason is that the temporary buffer used for I/O
(zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
compressed block will require a 128KB buffer to be allocated before
the I/O is done, leaving nothing for the raidz code further on. The
fix would be to make more the temporary buffer larger, but for some
reason it's not as simple as just changing the TEMP_SIZE define
(possibly a stack overflow results; more debugging needed).
Workaround: don't enable compression on your root filesystem (aka
bootfs).

2. gptzfsboot might fail to reconstruct a file that is read from a
degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
the pool is healthy but the checksums don't match). The reason again
is that the temporary buffer gets exhausted. I think this will only
happen in the case where more than one physical block is corrupt or
unreadable. The fix has several aspects: 1) make the temporary buffer
much larger, perhaps larger than 640KB; 2) change
zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
when possible; and 3) either restructure
zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
buffers once per I/O, or use a malloc that has free() implemented.
Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
are bad.

3. gptzfsboot might fail to boot from a degraded pool that has one or
more drives marked offline, removed, or faulted. The reason is that
vdev_probe() assumes that all vdevs are healthy, regardless of their
true state. gptzfsboot then will read from an offline/removed/faulted
vdev as if it were healthy, likely resulting in failed checksums,
resulting in the recovery code path being run in vdev_raidz_read(),
possibly leading to zfs_temp_buf exhaustion as in #2 above.

A partial patch for #3 is attached, but it is inadequate because it
only reads a vdev's status from the first device's (in BIOS order)
vdev_label, with the result that if the first device is marked
offline, gptzfsboot won't see this because only the other devices'
vdev_labels will indicate that the first device is offline. (Since
after a device is offlined no further writes will be made to the
device, its vdev_label is not updated to reflect that it's offline.)
To complete the patch it would be necessary to set each leaf vdev's
status from the newest vdev_label rather than from the first
vdev_label seen.

I think I've also hit a stack overflow a couple of times while debugging.

I don't know enough about the gptzfsboot/loader.zfs environment to
know whether the heap size could be easily enlarged, or whether there
is room for a real malloc() with free(). loader(8) seems to use the
malloc() in libstand. Can anyone shed some light on the memory
limitations and possible solutions?

I won't be able to spend much more time on this, but I wanted to pass
on what I've learned in case someone else has the time and boot fu to
take it the next step.

Matt

--00504502b672cac4d00478d6ed9a
Content-Type: application/octet-stream; name="zfsboot-status.patch"
Content-Disposition: attachment; filename="zfsboot-status.patch"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_g29nt7rn0

LS0tIHpmcy96ZnNpbXBsLmMub3JpZwkyMDA5LTEwLTI0IDE4OjEwOjI5LjAwMDAwMDAwMCAtMDcw
MAorKysgemZzL3pmc2ltcGwuYwkyMDA5LTExLTIwIDE2OjQ0OjQ5LjAwMDAwMDAwMCAtMDgwMApA
QCAtMzk2LDYgKzM5Niw3IEBACiAJdmRldi0+dl9yZWFkID0gcmVhZDsKIAl2ZGV2LT52X3BoeXNf
cmVhZCA9IDA7CiAJdmRldi0+dl9yZWFkX3ByaXYgPSAwOworCXZkZXYtPnZfaW5pdGVkID0gMDsK
IAlTVEFJTFFfSU5TRVJUX1RBSUwoJnpmc192ZGV2cywgdmRldiwgdl9hbGxsaW5rKTsKIAogCXJl
dHVybiAodmRldik7CkBAIC00MTEsNiArNDEyLDcgQEAKIAl2ZGV2X3QgKnZkZXYsICpraWQ7CiAJ
Y29uc3QgdW5zaWduZWQgY2hhciAqa2lkczsKIAlpbnQgbmtpZHMsIGk7CisJdWludDY0X3QgaXNf
b2ZmbGluZSwgaXNfZmF1bHRlZCwgaXNfZGVncmFkZWQsIGlzX3JlbW92ZWQ7CiAKIAlpZiAobnZs
aXN0X2ZpbmQobnZsaXN0LCBaUE9PTF9DT05GSUdfR1VJRCwKIAkJCURBVEFfVFlQRV9VSU5UNjQs
IDAsICZndWlkKQpAQCAtNDc4LDYgKzQ4MCwyNiBAQAogCQkJdmRldi0+dl9uYW1lID0gc3RyZHVw
KHR5cGUpOwogCQl9CiAJfQorCisJaWYgKCFudmxpc3RfZmluZChudmxpc3QsIFpQT09MX0NPTkZJ
R19PRkZMSU5FLAorCQkJIERBVEFfVFlQRV9VSU5UNjQsIDAsICZpc19vZmZsaW5lKSAmJgorCQkJ
IGlzX29mZmxpbmUpIHsKKwkJdmRldi0+dl9zdGF0ZSA9IFZERVZfU1RBVEVfT0ZGTElORTsKKwl9
IGVsc2UgaWYgKCFudmxpc3RfZmluZChudmxpc3QsIFpQT09MX0NPTkZJR19SRU1PVkVELAorCQkJ
CURBVEFfVFlQRV9VSU5UNjQsIDAsICZpc19yZW1vdmVkKSAmJgorCQkJCWlzX3JlbW92ZWQpIHsK
KwkJdmRldi0+dl9zdGF0ZSA9IFZERVZfU1RBVEVfUkVNT1ZFRDsKKwl9IGVsc2UgaWYgKCFudmxp
c3RfZmluZChudmxpc3QsIFpQT09MX0NPTkZJR19GQVVMVEVELAorCQkJCURBVEFfVFlQRV9VSU5U
NjQsIDAsICZpc19mYXVsdGVkKSAmJgorCQkJCWlzX2ZhdWx0ZWQpIHsKKwkJdmRldi0+dl9zdGF0
ZSA9IFZERVZfU1RBVEVfRkFVTFRFRDsKKwl9IGVsc2UgaWYgKCFudmxpc3RfZmluZChudmxpc3Qs
IFpQT09MX0NPTkZJR19ERUdSQURFRCwKKwkJCQlEQVRBX1RZUEVfVUlOVDY0LCAwLCAmaXNfZGVn
cmFkZWQpICYmCisJCQkJaXNfZGVncmFkZWQpIHsKKwkJdmRldi0+dl9zdGF0ZSA9IFZERVZfU1RB
VEVfREVHUkFERUQ7CisJfSBlbHNlCisJCXZkZXYtPnZfc3RhdGUgPSBWREVWX1NUQVRFX0hFQUxU
SFk7CisKIAlyYyA9IG52bGlzdF9maW5kKG52bGlzdCwgWlBPT0xfQ09ORklHX0NISUxEUkVOLAog
CQkJIERBVEFfVFlQRV9OVkxJU1RfQVJSQVksICZua2lkcywgJmtpZHMpOwogCS8qCkBAIC01OTEs
NyArNjEzLDkgQEAKIAkJIlVOS05PV04iLAogCQkiQ0xPU0VEIiwKIAkJIk9GRkxJTkUiLAorCQki
UkVNT1ZFRCIsCiAJCSJDQU5UX09QRU4iLAorCQkiRkFVTFRFRCIsCiAJCSJERUdSQURFRCIsCiAJ
CSJPTkxJTkUiCiAJfTsKQEAgLTgwNiw3ICs4MzAsNyBAQAogCQlyZXR1cm4gKEVJTyk7CiAJfQog
CXZkZXYgPSB2ZGV2X2ZpbmQoZ3VpZCk7Ci0JaWYgKHZkZXYgJiYgdmRldi0+dl9zdGF0ZSA9PSBW
REVWX1NUQVRFX0hFQUxUSFkpIHsKKwlpZiAodmRldiAmJiB2ZGV2LT52X2luaXRlZCkgewogCQly
ZXR1cm4gKEVJTyk7CiAJfQogCkBAIC04MzYsNyArODYwLDcgQEAKIAlpZiAodmRldikgewogCQl2
ZGV2LT52X3BoeXNfcmVhZCA9IHJlYWQ7CiAJCXZkZXYtPnZfcmVhZF9wcml2ID0gcmVhZF9wcml2
OwotCQl2ZGV2LT52X3N0YXRlID0gVkRFVl9TVEFURV9IRUFMVEhZOworCQl2ZGV2LT52X2luaXRl
ZCA9IDE7CiAJfSBlbHNlIHsKIAkJcHJpbnRmKCJaRlM6IGluY29uc2lzdGVudCBudmxpc3QgY29u
dGVudHNcbiIpOwogCQlyZXR1cm4gKEVJTyk7Ci0tLSB6ZnNpbXBsLmgub3JpZwkyMDA5LTA1LTE2
IDAzOjQ4OjIwLjAwMDAwMDAwMCAtMDcwMAorKysgemZzaW1wbC5oCTIwMDktMTEtMTMgMTc6MzI6
MDYuMDAwMDAwMDAwIC0wODAwCkBAIC01MjgsNyArNTI4LDYgQEAKICNkZWZpbmUJWlBPT0xfQ09O
RklHX0RUTAkJIkRUTCIKICNkZWZpbmUJWlBPT0xfQ09ORklHX1NUQVRTCQkic3RhdHMiCiAjZGVm
aW5lCVpQT09MX0NPTkZJR19XSE9MRV9ESVNLCQkid2hvbGVfZGlzayIKLSNkZWZpbmUJWlBPT0xf
Q09ORklHX09GRkxJTkUJCSJvZmZsaW5lIgogI2RlZmluZQlaUE9PTF9DT05GSUdfRVJSQ09VTlQJ
CSJlcnJvcl9jb3VudCIKICNkZWZpbmUJWlBPT0xfQ09ORklHX05PVF9QUkVTRU5UCSJub3RfcHJl
c2VudCIKICNkZWZpbmUJWlBPT0xfQ09ORklHX1NQQVJFUwkJInNwYXJlcyIKQEAgLTUzOCw2ICs1
MzcsMTYgQEAKICNkZWZpbmUJWlBPT0xfQ09ORklHX0hPU1ROQU1FCQkiaG9zdG5hbWUiCiAjZGVm
aW5lCVpQT09MX0NPTkZJR19USU1FU1RBTVAJCSJ0aW1lc3RhbXAiIC8qIG5vdCBzdG9yZWQgb24g
ZGlzayAqLwogCisvKgorICogVGhlIHBlcnNpc3RlbnQgdmRldiBzdGF0ZSBpcyBzdG9yZWQgYXMg
c2VwYXJhdGUgdmFsdWVzIHJhdGhlciB0aGFuIGEgc2luZ2xlCisgKiAndmRldl9zdGF0ZScgZW50
cnkuICBUaGlzIGlzIGJlY2F1c2UgYSBkZXZpY2UgY2FuIGJlIGluIG11bHRpcGxlIHN0YXRlcywg
c3VjaAorICogYXMgb2ZmbGluZSBhbmQgZGVncmFkZWQuCisgKi8KKyNkZWZpbmUgWlBPT0xfQ09O
RklHX09GRkxJTkUgICAgICAgICAgICAib2ZmbGluZSIKKyNkZWZpbmUgWlBPT0xfQ09ORklHX0ZB
VUxURUQgICAgICAgICAgICAiZmF1bHRlZCIKKyNkZWZpbmUgWlBPT0xfQ09ORklHX0RFR1JBREVE
ICAgICAgICAgICAiZGVncmFkZWQiCisjZGVmaW5lIFpQT09MX0NPTkZJR19SRU1PVkVEICAgICAg
ICAgICAgInJlbW92ZWQiCisKICNkZWZpbmUJVkRFVl9UWVBFX1JPT1QJCQkicm9vdCIKICNkZWZp
bmUJVkRFVl9UWVBFX01JUlJPUgkJIm1pcnJvciIKICNkZWZpbmUJVkRFVl9UWVBFX1JFUExBQ0lO
RwkJInJlcGxhY2luZyIKQEAgLTU3MCw3ICs1NzksOSBAQAogCVZERVZfU1RBVEVfVU5LTk9XTiA9
IDAsCS8qIFVuaW5pdGlhbGl6ZWQgdmRldgkJCSovCiAJVkRFVl9TVEFURV9DTE9TRUQsCS8qIE5v
dCBjdXJyZW50bHkgb3BlbgkJCSovCiAJVkRFVl9TVEFURV9PRkZMSU5FLAkvKiBOb3QgYWxsb3dl
ZCB0byBvcGVuCQkJKi8KKyAgICAgICAgVkRFVl9TVEFURV9SRU1PVkVELAkvKiBFeHBsaWNpdGx5
IHJlbW92ZWQgZnJvbSBzeXN0ZW0JKi8KIAlWREVWX1NUQVRFX0NBTlRfT1BFTiwJLyogVHJpZWQg
dG8gb3BlbiwgYnV0IGZhaWxlZAkJKi8KKyAgICAgICAgVkRFVl9TVEFURV9GQVVMVEVELAkvKiBF
eHRlcm5hbCByZXF1ZXN0IHRvIGZhdWx0IGRldmljZQkqLwogCVZERVZfU1RBVEVfREVHUkFERUQs
CS8qIFJlcGxpY2F0ZWQgdmRldiB3aXRoIHVuaGVhbHRoeSBraWRzCSovCiAJVkRFVl9TVEFURV9I
RUFMVEhZCS8qIFByZXN1bWVkIGdvb2QJCQkqLwogfSB2ZGV2X3N0YXRlX3Q7CkBAIC0xMTU4LDYg
KzExNjksNyBAQAogCXZkZXZfcGh5c19yZWFkX3QgKnZfcGh5c19yZWFkOwkvKiByZWFkIGZyb20g
cmF3IGxlYWYgdmRldiAqLwogCXZkZXZfcmVhZF90CSp2X3JlYWQ7CS8qIHJlYWQgZnJvbSB2ZGV2
ICovCiAJdm9pZAkJKnZfcmVhZF9wcml2OwkvKiBwcml2YXRlIGRhdGEgZm9yIHJlYWQgZnVuY3Rp
b24gKi8KKwlpbnQJCXZfaW5pdGVkOwogfSB2ZGV2X3Q7CiAKIC8qCg==
--00504502b672cac4d00478d6ed9a--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f383264b0911201646s702c8aa4u5e50a71f93a9e4eb>