Date: Mon, 23 Nov 2009 10:18:40 -0500 From: John Baldwin <jhb@freebsd.org> To: freebsd-fs@freebsd.org Subject: Re: Current gptzfsboot limitations Message-ID: <200911231018.40815.jhb@freebsd.org> In-Reply-To: <f383264b0911201646s702c8aa4u5e50a71f93a9e4eb@mail.gmail.com> References: <f383264b0911201646s702c8aa4u5e50a71f93a9e4eb@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Friday 20 November 2009 7:46:54 pm Matt Reimer wrote: > I've been analyzing gptzfsboot to see what its limitations are. I > think it should now work fine for a healthy pool with any number of > disks, with any type of vdev, whether single disk, stripe, mirror, > raidz or raidz2. > > But there are currently several limitations (likely in loader.zfs > too), mostly due to the limited amount of memory available (< 640KB) > and the simple memory allocators used (a simple malloc() and > zfs_alloc_temp()). > > 1. gptzfsboot might fail to read compressed files on raidz/raidz2 > pools. The reason is that the temporary buffer used for I/O > (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB > compressed block will require a 128KB buffer to be allocated before > the I/O is done, leaving nothing for the raidz code further on. The > fix would be to make more the temporary buffer larger, but for some > reason it's not as simple as just changing the TEMP_SIZE define > (possibly a stack overflow results; more debugging needed). > Workaround: don't enable compression on your root filesystem (aka > bootfs). > > 2. gptzfsboot might fail to reconstruct a file that is read from a > degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e. > the pool is healthy but the checksums don't match). The reason again > is that the temporary buffer gets exhausted. I think this will only > happen in the case where more than one physical block is corrupt or > unreadable. The fix has several aspects: 1) make the temporary buffer > much larger, perhaps larger than 640KB; 2) change > zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates > when possible; and 3) either restructure > zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary > buffers once per I/O, or use a malloc that has free() implemented. > Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks > are bad. > > 3. gptzfsboot might fail to boot from a degraded pool that has one or > more drives marked offline, removed, or faulted. The reason is that > vdev_probe() assumes that all vdevs are healthy, regardless of their > true state. gptzfsboot then will read from an offline/removed/faulted > vdev as if it were healthy, likely resulting in failed checksums, > resulting in the recovery code path being run in vdev_raidz_read(), > possibly leading to zfs_temp_buf exhaustion as in #2 above. > > A partial patch for #3 is attached, but it is inadequate because it > only reads a vdev's status from the first device's (in BIOS order) > vdev_label, with the result that if the first device is marked > offline, gptzfsboot won't see this because only the other devices' > vdev_labels will indicate that the first device is offline. (Since > after a device is offlined no further writes will be made to the > device, its vdev_label is not updated to reflect that it's offline.) > To complete the patch it would be necessary to set each leaf vdev's > status from the newest vdev_label rather than from the first > vdev_label seen. > > I think I've also hit a stack overflow a couple of times while debugging. > > I don't know enough about the gptzfsboot/loader.zfs environment to > know whether the heap size could be easily enlarged, or whether there > is room for a real malloc() with free(). loader(8) seems to use the > malloc() in libstand. Can anyone shed some light on the memory > limitations and possible solutions? > > I won't be able to spend much more time on this, but I wanted to pass > on what I've learned in case someone else has the time and boot fu to > take it the next step. One issue is that disk transfers need to happen in the lower 1MB due to BIOS limitations. The loader uses a bounce buffer (in biosdisk.c in libi386) to make this work ok. The loader uses memory > 1MB for malloc(). You could probably change zfsboot to do that as well if not already. Just note that drvread() has to bounce buffer requests in that case. The text + data + bss + stack is all in the lower 640k and there's not much you can do about that. The stack grows down from 640k, and the boot program text + data starts at 64k with the bss following. Hmm, drvread() might already be bounce buffering since boot2 has to do so since it copies the loader up to memory > 1MB as well. You might need to use memory > 2MB for zfsboot's malloc() so that the loader can be copied up to 1MB. It looks like you could patch malloc() in zfsboot.c to use 4*1024*1024 as heap_next and maybe 64*1024*1024 as heap_end (this assumes all machines that boot ZFS have at least 64MB of RAM, which is probably safe). -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200911231018.40815.jhb>