From owner-freebsd-fs@FreeBSD.ORG Sat Nov 21 00:46:55 2009 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 277A91065670 for ; Sat, 21 Nov 2009 00:46:55 +0000 (UTC) (envelope-from mattjreimer@gmail.com) Received: from mail-px0-f196.google.com (mail-px0-f196.google.com [209.85.216.196]) by mx1.freebsd.org (Postfix) with ESMTP id F2DFD8FC13 for ; Sat, 21 Nov 2009 00:46:54 +0000 (UTC) Received: by pxi34 with SMTP id 34so2638169pxi.8 for ; Fri, 20 Nov 2009 16:46:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=4lxnM2Y2k0KxAYwwiipN/NGv7BHTwGgQ8po0ph4upG8=; b=OOtJy+URgIlCuzJ+XVXz34nOjytkUYhoG60EZokReUync0qPd8MmSIikKbynEgvNia 6iDxYCco4vtB5o1HhA5HL2ysuob4ammDd8OOnvsONOJbOZb/s1AufcDALXlclR59nXPf +f6E8qUwOgbB9SHsTL3hMUZzCuYsq6arkQKuw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=Cwl6aeLNqL+1IJieFpWi1PbZ4Cjo3FtVYW74sEvTBpi8lsYYJCSB/0d4MIwlSbVgne xv1AM4iEcInb513tuwIa3U2IPHR48ryRIl+P8FgtBit7NaYtzEs8yWTHaGkmWrVJROfm B22htwZd6fbyFtwyQ5FtUaFDgemK727ABLXT0= MIME-Version: 1.0 Received: by 10.142.1.22 with SMTP id 22mr228408wfa.340.1258764414665; Fri, 20 Nov 2009 16:46:54 -0800 (PST) Date: Fri, 20 Nov 2009 16:46:54 -0800 Message-ID: From: Matt Reimer To: fs@freebsd.org Content-Type: multipart/mixed; boundary=00504502b672cac4d00478d6ed9a Cc: Subject: Current gptzfsboot limitations X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Nov 2009 00:46:55 -0000 --00504502b672cac4d00478d6ed9a Content-Type: text/plain; charset=ISO-8859-1 I've been analyzing gptzfsboot to see what its limitations are. I think it should now work fine for a healthy pool with any number of disks, with any type of vdev, whether single disk, stripe, mirror, raidz or raidz2. But there are currently several limitations (likely in loader.zfs too), mostly due to the limited amount of memory available (< 640KB) and the simple memory allocators used (a simple malloc() and zfs_alloc_temp()). 1. gptzfsboot might fail to read compressed files on raidz/raidz2 pools. The reason is that the temporary buffer used for I/O (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB compressed block will require a 128KB buffer to be allocated before the I/O is done, leaving nothing for the raidz code further on. The fix would be to make more the temporary buffer larger, but for some reason it's not as simple as just changing the TEMP_SIZE define (possibly a stack overflow results; more debugging needed). Workaround: don't enable compression on your root filesystem (aka bootfs). 2. gptzfsboot might fail to reconstruct a file that is read from a degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e. the pool is healthy but the checksums don't match). The reason again is that the temporary buffer gets exhausted. I think this will only happen in the case where more than one physical block is corrupt or unreadable. The fix has several aspects: 1) make the temporary buffer much larger, perhaps larger than 640KB; 2) change zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates when possible; and 3) either restructure zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary buffers once per I/O, or use a malloc that has free() implemented. Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks are bad. 3. gptzfsboot might fail to boot from a degraded pool that has one or more drives marked offline, removed, or faulted. The reason is that vdev_probe() assumes that all vdevs are healthy, regardless of their true state. gptzfsboot then will read from an offline/removed/faulted vdev as if it were healthy, likely resulting in failed checksums, resulting in the recovery code path being run in vdev_raidz_read(), possibly leading to zfs_temp_buf exhaustion as in #2 above. A partial patch for #3 is attached, but it is inadequate because it only reads a vdev's status from the first device's (in BIOS order) vdev_label, with the result that if the first device is marked offline, gptzfsboot won't see this because only the other devices' vdev_labels will indicate that the first device is offline. (Since after a device is offlined no further writes will be made to the device, its vdev_label is not updated to reflect that it's offline.) To complete the patch it would be necessary to set each leaf vdev's status from the newest vdev_label rather than from the first vdev_label seen. I think I've also hit a stack overflow a couple of times while debugging. I don't know enough about the gptzfsboot/loader.zfs environment to know whether the heap size could be easily enlarged, or whether there is room for a real malloc() with free(). loader(8) seems to use the malloc() in libstand. Can anyone shed some light on the memory limitations and possible solutions? I won't be able to spend much more time on this, but I wanted to pass on what I've learned in case someone else has the time and boot fu to take it the next step. Matt --00504502b672cac4d00478d6ed9a Content-Type: application/octet-stream; name="zfsboot-status.patch" Content-Disposition: attachment; filename="zfsboot-status.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_g29nt7rn0 LS0tIHpmcy96ZnNpbXBsLmMub3JpZwkyMDA5LTEwLTI0IDE4OjEwOjI5LjAwMDAwMDAwMCAtMDcw MAorKysgemZzL3pmc2ltcGwuYwkyMDA5LTExLTIwIDE2OjQ0OjQ5LjAwMDAwMDAwMCAtMDgwMApA QCAtMzk2LDYgKzM5Niw3IEBACiAJdmRldi0+dl9yZWFkID0gcmVhZDsKIAl2ZGV2LT52X3BoeXNf cmVhZCA9IDA7CiAJdmRldi0+dl9yZWFkX3ByaXYgPSAwOworCXZkZXYtPnZfaW5pdGVkID0gMDsK IAlTVEFJTFFfSU5TRVJUX1RBSUwoJnpmc192ZGV2cywgdmRldiwgdl9hbGxsaW5rKTsKIAogCXJl dHVybiAodmRldik7CkBAIC00MTEsNiArNDEyLDcgQEAKIAl2ZGV2X3QgKnZkZXYsICpraWQ7CiAJ Y29uc3QgdW5zaWduZWQgY2hhciAqa2lkczsKIAlpbnQgbmtpZHMsIGk7CisJdWludDY0X3QgaXNf b2ZmbGluZSwgaXNfZmF1bHRlZCwgaXNfZGVncmFkZWQsIGlzX3JlbW92ZWQ7CiAKIAlpZiAobnZs aXN0X2ZpbmQobnZsaXN0LCBaUE9PTF9DT05GSUdfR1VJRCwKIAkJCURBVEFfVFlQRV9VSU5UNjQs IDAsICZndWlkKQpAQCAtNDc4LDYgKzQ4MCwyNiBAQAogCQkJdmRldi0+dl9uYW1lID0gc3RyZHVw KHR5cGUpOwogCQl9CiAJfQorCisJaWYgKCFudmxpc3RfZmluZChudmxpc3QsIFpQT09MX0NPTkZJ R19PRkZMSU5FLAorCQkJIERBVEFfVFlQRV9VSU5UNjQsIDAsICZpc19vZmZsaW5lKSAmJgorCQkJ IGlzX29mZmxpbmUpIHsKKwkJdmRldi0+dl9zdGF0ZSA9IFZERVZfU1RBVEVfT0ZGTElORTsKKwl9 IGVsc2UgaWYgKCFudmxpc3RfZmluZChudmxpc3QsIFpQT09MX0NPTkZJR19SRU1PVkVELAorCQkJ CURBVEFfVFlQRV9VSU5UNjQsIDAsICZpc19yZW1vdmVkKSAmJgorCQkJCWlzX3JlbW92ZWQpIHsK KwkJdmRldi0+dl9zdGF0ZSA9IFZERVZfU1RBVEVfUkVNT1ZFRDsKKwl9IGVsc2UgaWYgKCFudmxp c3RfZmluZChudmxpc3QsIFpQT09MX0NPTkZJR19GQVVMVEVELAorCQkJCURBVEFfVFlQRV9VSU5U NjQsIDAsICZpc19mYXVsdGVkKSAmJgorCQkJCWlzX2ZhdWx0ZWQpIHsKKwkJdmRldi0+dl9zdGF0 ZSA9IFZERVZfU1RBVEVfRkFVTFRFRDsKKwl9IGVsc2UgaWYgKCFudmxpc3RfZmluZChudmxpc3Qs IFpQT09MX0NPTkZJR19ERUdSQURFRCwKKwkJCQlEQVRBX1RZUEVfVUlOVDY0LCAwLCAmaXNfZGVn cmFkZWQpICYmCisJCQkJaXNfZGVncmFkZWQpIHsKKwkJdmRldi0+dl9zdGF0ZSA9IFZERVZfU1RB VEVfREVHUkFERUQ7CisJfSBlbHNlCisJCXZkZXYtPnZfc3RhdGUgPSBWREVWX1NUQVRFX0hFQUxU SFk7CisKIAlyYyA9IG52bGlzdF9maW5kKG52bGlzdCwgWlBPT0xfQ09ORklHX0NISUxEUkVOLAog CQkJIERBVEFfVFlQRV9OVkxJU1RfQVJSQVksICZua2lkcywgJmtpZHMpOwogCS8qCkBAIC01OTEs NyArNjEzLDkgQEAKIAkJIlVOS05PV04iLAogCQkiQ0xPU0VEIiwKIAkJIk9GRkxJTkUiLAorCQki UkVNT1ZFRCIsCiAJCSJDQU5UX09QRU4iLAorCQkiRkFVTFRFRCIsCiAJCSJERUdSQURFRCIsCiAJ CSJPTkxJTkUiCiAJfTsKQEAgLTgwNiw3ICs4MzAsNyBAQAogCQlyZXR1cm4gKEVJTyk7CiAJfQog CXZkZXYgPSB2ZGV2X2ZpbmQoZ3VpZCk7Ci0JaWYgKHZkZXYgJiYgdmRldi0+dl9zdGF0ZSA9PSBW REVWX1NUQVRFX0hFQUxUSFkpIHsKKwlpZiAodmRldiAmJiB2ZGV2LT52X2luaXRlZCkgewogCQly ZXR1cm4gKEVJTyk7CiAJfQogCkBAIC04MzYsNyArODYwLDcgQEAKIAlpZiAodmRldikgewogCQl2 ZGV2LT52X3BoeXNfcmVhZCA9IHJlYWQ7CiAJCXZkZXYtPnZfcmVhZF9wcml2ID0gcmVhZF9wcml2 OwotCQl2ZGV2LT52X3N0YXRlID0gVkRFVl9TVEFURV9IRUFMVEhZOworCQl2ZGV2LT52X2luaXRl ZCA9IDE7CiAJfSBlbHNlIHsKIAkJcHJpbnRmKCJaRlM6IGluY29uc2lzdGVudCBudmxpc3QgY29u dGVudHNcbiIpOwogCQlyZXR1cm4gKEVJTyk7Ci0tLSB6ZnNpbXBsLmgub3JpZwkyMDA5LTA1LTE2 IDAzOjQ4OjIwLjAwMDAwMDAwMCAtMDcwMAorKysgemZzaW1wbC5oCTIwMDktMTEtMTMgMTc6MzI6 MDYuMDAwMDAwMDAwIC0wODAwCkBAIC01MjgsNyArNTI4LDYgQEAKICNkZWZpbmUJWlBPT0xfQ09O RklHX0RUTAkJIkRUTCIKICNkZWZpbmUJWlBPT0xfQ09ORklHX1NUQVRTCQkic3RhdHMiCiAjZGVm aW5lCVpQT09MX0NPTkZJR19XSE9MRV9ESVNLCQkid2hvbGVfZGlzayIKLSNkZWZpbmUJWlBPT0xf Q09ORklHX09GRkxJTkUJCSJvZmZsaW5lIgogI2RlZmluZQlaUE9PTF9DT05GSUdfRVJSQ09VTlQJ CSJlcnJvcl9jb3VudCIKICNkZWZpbmUJWlBPT0xfQ09ORklHX05PVF9QUkVTRU5UCSJub3RfcHJl c2VudCIKICNkZWZpbmUJWlBPT0xfQ09ORklHX1NQQVJFUwkJInNwYXJlcyIKQEAgLTUzOCw2ICs1 MzcsMTYgQEAKICNkZWZpbmUJWlBPT0xfQ09ORklHX0hPU1ROQU1FCQkiaG9zdG5hbWUiCiAjZGVm aW5lCVpQT09MX0NPTkZJR19USU1FU1RBTVAJCSJ0aW1lc3RhbXAiIC8qIG5vdCBzdG9yZWQgb24g ZGlzayAqLwogCisvKgorICogVGhlIHBlcnNpc3RlbnQgdmRldiBzdGF0ZSBpcyBzdG9yZWQgYXMg c2VwYXJhdGUgdmFsdWVzIHJhdGhlciB0aGFuIGEgc2luZ2xlCisgKiAndmRldl9zdGF0ZScgZW50 cnkuICBUaGlzIGlzIGJlY2F1c2UgYSBkZXZpY2UgY2FuIGJlIGluIG11bHRpcGxlIHN0YXRlcywg c3VjaAorICogYXMgb2ZmbGluZSBhbmQgZGVncmFkZWQuCisgKi8KKyNkZWZpbmUgWlBPT0xfQ09O RklHX09GRkxJTkUgICAgICAgICAgICAib2ZmbGluZSIKKyNkZWZpbmUgWlBPT0xfQ09ORklHX0ZB VUxURUQgICAgICAgICAgICAiZmF1bHRlZCIKKyNkZWZpbmUgWlBPT0xfQ09ORklHX0RFR1JBREVE ICAgICAgICAgICAiZGVncmFkZWQiCisjZGVmaW5lIFpQT09MX0NPTkZJR19SRU1PVkVEICAgICAg ICAgICAgInJlbW92ZWQiCisKICNkZWZpbmUJVkRFVl9UWVBFX1JPT1QJCQkicm9vdCIKICNkZWZp bmUJVkRFVl9UWVBFX01JUlJPUgkJIm1pcnJvciIKICNkZWZpbmUJVkRFVl9UWVBFX1JFUExBQ0lO RwkJInJlcGxhY2luZyIKQEAgLTU3MCw3ICs1NzksOSBAQAogCVZERVZfU1RBVEVfVU5LTk9XTiA9 IDAsCS8qIFVuaW5pdGlhbGl6ZWQgdmRldgkJCSovCiAJVkRFVl9TVEFURV9DTE9TRUQsCS8qIE5v dCBjdXJyZW50bHkgb3BlbgkJCSovCiAJVkRFVl9TVEFURV9PRkZMSU5FLAkvKiBOb3QgYWxsb3dl ZCB0byBvcGVuCQkJKi8KKyAgICAgICAgVkRFVl9TVEFURV9SRU1PVkVELAkvKiBFeHBsaWNpdGx5 IHJlbW92ZWQgZnJvbSBzeXN0ZW0JKi8KIAlWREVWX1NUQVRFX0NBTlRfT1BFTiwJLyogVHJpZWQg dG8gb3BlbiwgYnV0IGZhaWxlZAkJKi8KKyAgICAgICAgVkRFVl9TVEFURV9GQVVMVEVELAkvKiBF eHRlcm5hbCByZXF1ZXN0IHRvIGZhdWx0IGRldmljZQkqLwogCVZERVZfU1RBVEVfREVHUkFERUQs CS8qIFJlcGxpY2F0ZWQgdmRldiB3aXRoIHVuaGVhbHRoeSBraWRzCSovCiAJVkRFVl9TVEFURV9I RUFMVEhZCS8qIFByZXN1bWVkIGdvb2QJCQkqLwogfSB2ZGV2X3N0YXRlX3Q7CkBAIC0xMTU4LDYg KzExNjksNyBAQAogCXZkZXZfcGh5c19yZWFkX3QgKnZfcGh5c19yZWFkOwkvKiByZWFkIGZyb20g cmF3IGxlYWYgdmRldiAqLwogCXZkZXZfcmVhZF90CSp2X3JlYWQ7CS8qIHJlYWQgZnJvbSB2ZGV2 ICovCiAJdm9pZAkJKnZfcmVhZF9wcml2OwkvKiBwcml2YXRlIGRhdGEgZm9yIHJlYWQgZnVuY3Rp b24gKi8KKwlpbnQJCXZfaW5pdGVkOwogfSB2ZGV2X3Q7CiAKIC8qCg== --00504502b672cac4d00478d6ed9a--