From owner-freebsd-fs@FreeBSD.ORG Mon Nov 23 15:40:56 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4A9711065692 for ; Mon, 23 Nov 2009 15:40:56 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 1C98A8FC12 for ; Mon, 23 Nov 2009 15:40:56 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id C182B46B2C; Mon, 23 Nov 2009 10:40:55 -0500 (EST) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 190F18A01B; Mon, 23 Nov 2009 10:40:55 -0500 (EST) From: John Baldwin To: freebsd-fs@freebsd.org Date: Mon, 23 Nov 2009 10:18:40 -0500 User-Agent: KMail/1.9.7 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200911231018.40815.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 23 Nov 2009 10:40:55 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Subject: Re: Current gptzfsboot limitations X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Nov 2009 15:40:56 -0000 On Friday 20 November 2009 7:46:54 pm Matt Reimer wrote: > I've been analyzing gptzfsboot to see what its limitations are. I > think it should now work fine for a healthy pool with any number of > disks, with any type of vdev, whether single disk, stripe, mirror, > raidz or raidz2. > > But there are currently several limitations (likely in loader.zfs > too), mostly due to the limited amount of memory available (< 640KB) > and the simple memory allocators used (a simple malloc() and > zfs_alloc_temp()). > > 1. gptzfsboot might fail to read compressed files on raidz/raidz2 > pools. The reason is that the temporary buffer used for I/O > (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB > compressed block will require a 128KB buffer to be allocated before > the I/O is done, leaving nothing for the raidz code further on. The > fix would be to make more the temporary buffer larger, but for some > reason it's not as simple as just changing the TEMP_SIZE define > (possibly a stack overflow results; more debugging needed). > Workaround: don't enable compression on your root filesystem (aka > bootfs). > > 2. gptzfsboot might fail to reconstruct a file that is read from a > degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e. > the pool is healthy but the checksums don't match). The reason again > is that the temporary buffer gets exhausted. I think this will only > happen in the case where more than one physical block is corrupt or > unreadable. The fix has several aspects: 1) make the temporary buffer > much larger, perhaps larger than 640KB; 2) change > zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates > when possible; and 3) either restructure > zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary > buffers once per I/O, or use a malloc that has free() implemented. > Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks > are bad. > > 3. gptzfsboot might fail to boot from a degraded pool that has one or > more drives marked offline, removed, or faulted. The reason is that > vdev_probe() assumes that all vdevs are healthy, regardless of their > true state. gptzfsboot then will read from an offline/removed/faulted > vdev as if it were healthy, likely resulting in failed checksums, > resulting in the recovery code path being run in vdev_raidz_read(), > possibly leading to zfs_temp_buf exhaustion as in #2 above. > > A partial patch for #3 is attached, but it is inadequate because it > only reads a vdev's status from the first device's (in BIOS order) > vdev_label, with the result that if the first device is marked > offline, gptzfsboot won't see this because only the other devices' > vdev_labels will indicate that the first device is offline. (Since > after a device is offlined no further writes will be made to the > device, its vdev_label is not updated to reflect that it's offline.) > To complete the patch it would be necessary to set each leaf vdev's > status from the newest vdev_label rather than from the first > vdev_label seen. > > I think I've also hit a stack overflow a couple of times while debugging. > > I don't know enough about the gptzfsboot/loader.zfs environment to > know whether the heap size could be easily enlarged, or whether there > is room for a real malloc() with free(). loader(8) seems to use the > malloc() in libstand. Can anyone shed some light on the memory > limitations and possible solutions? > > I won't be able to spend much more time on this, but I wanted to pass > on what I've learned in case someone else has the time and boot fu to > take it the next step. One issue is that disk transfers need to happen in the lower 1MB due to BIOS limitations. The loader uses a bounce buffer (in biosdisk.c in libi386) to make this work ok. The loader uses memory > 1MB for malloc(). You could probably change zfsboot to do that as well if not already. Just note that drvread() has to bounce buffer requests in that case. The text + data + bss + stack is all in the lower 640k and there's not much you can do about that. The stack grows down from 640k, and the boot program text + data starts at 64k with the bss following. Hmm, drvread() might already be bounce buffering since boot2 has to do so since it copies the loader up to memory > 1MB as well. You might need to use memory > 2MB for zfsboot's malloc() so that the loader can be copied up to 1MB. It looks like you could patch malloc() in zfsboot.c to use 4*1024*1024 as heap_next and maybe 64*1024*1024 as heap_end (this assumes all machines that boot ZFS have at least 64MB of RAM, which is probably safe). -- John Baldwin