From owner-freebsd-fs@FreeBSD.ORG  Mon Nov 23 15:40:56 2009
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4A9711065692
	for <freebsd-fs@freebsd.org>; Mon, 23 Nov 2009 15:40:56 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 1C98A8FC12
	for <freebsd-fs@freebsd.org>; Mon, 23 Nov 2009 15:40:56 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id C182B46B2C;
	Mon, 23 Nov 2009 10:40:55 -0500 (EST)
Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 190F18A01B;
	Mon, 23 Nov 2009 10:40:55 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-fs@freebsd.org
Date: Mon, 23 Nov 2009 10:18:40 -0500
User-Agent: KMail/1.9.7
References: <f383264b0911201646s702c8aa4u5e50a71f93a9e4eb@mail.gmail.com>
In-Reply-To: <f383264b0911201646s702c8aa4u5e50a71f93a9e4eb@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200911231018.40815.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 23 Nov 2009 10:40:55 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: 
Subject: Re: Current gptzfsboot limitations
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Nov 2009 15:40:56 -0000

On Friday 20 November 2009 7:46:54 pm Matt Reimer wrote:
> I've been analyzing gptzfsboot to see what its limitations are. I
> think it should now work fine for a healthy pool with any number of
> disks, with any type of vdev, whether single disk, stripe, mirror,
> raidz or raidz2.
> 
> But there are currently several limitations (likely in loader.zfs
> too), mostly due to the limited amount of memory available (< 640KB)
> and the simple memory allocators used (a simple malloc() and
> zfs_alloc_temp()).
> 
> 1. gptzfsboot might fail to read compressed files on raidz/raidz2
> pools. The reason is that the temporary buffer used for I/O
> (zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
> compressed block will require a 128KB buffer to be allocated before
> the I/O is done, leaving nothing for the raidz code further on. The
> fix would be to make more the temporary buffer larger, but for some
> reason it's not as simple as just changing the TEMP_SIZE define
> (possibly a stack overflow results; more debugging needed).
> Workaround: don't enable compression on your root filesystem (aka
> bootfs).
> 
> 2. gptzfsboot might fail to reconstruct a file that is read from a
> degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
> the pool is healthy but the checksums don't match). The reason again
> is that the temporary buffer gets exhausted. I think this will only
> happen in the case where more than one physical block is corrupt or
> unreadable. The fix has several aspects: 1) make the temporary buffer
> much larger, perhaps larger than 640KB; 2) change
> zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
> when possible; and 3) either restructure
> zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
> buffers once per I/O, or use a malloc that has free() implemented.
> Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
> are bad.
> 
> 3. gptzfsboot might fail to boot from a degraded pool that has one or
> more drives marked offline, removed, or faulted. The reason is that
> vdev_probe() assumes that all vdevs are healthy, regardless of their
> true state. gptzfsboot then will read from an offline/removed/faulted
> vdev as if it were healthy, likely resulting in failed checksums,
> resulting in the recovery code path being run in vdev_raidz_read(),
> possibly leading to zfs_temp_buf exhaustion as in #2 above.
> 
> A partial patch for #3 is attached, but it is inadequate because it
> only reads a vdev's status from the first device's (in BIOS order)
> vdev_label, with the result that if the first device is marked
> offline, gptzfsboot won't see this because only the other devices'
> vdev_labels will indicate that the first device is offline. (Since
> after a device is offlined no further writes will be made to the
> device, its vdev_label is not updated to reflect that it's offline.)
> To complete the patch it would be necessary to set each leaf vdev's
> status from the newest vdev_label rather than from the first
> vdev_label seen.
> 
> I think I've also hit a stack overflow a couple of times while debugging.
> 
> I don't know enough about the gptzfsboot/loader.zfs environment to
> know whether the heap size could be easily enlarged, or whether there
> is room for a real malloc() with free(). loader(8) seems to use the
> malloc() in libstand. Can anyone shed some light on the memory
> limitations and possible solutions?
> 
> I won't be able to spend much more time on this, but I wanted to pass
> on what I've learned in case someone else has the time and boot fu to
> take it the next step.

One issue is that disk transfers need to happen in the lower 1MB due to BIOS 
limitations.  The loader uses a bounce buffer (in biosdisk.c in libi386) to 
make this work ok.  The loader uses memory > 1MB for malloc().  You could 
probably change zfsboot to do that as well if not already.  Just note that 
drvread() has to bounce buffer requests in that case.  The text + data + bss 
+ stack is all in the lower 640k and there's not much you can do about that.  
The stack grows down from 640k, and the boot program text + data starts at 
64k with the bss following.  Hmm, drvread() might already be bounce buffering 
since boot2 has to do so since it copies the loader up to memory > 1MB as 
well.  You might need to use memory > 2MB for zfsboot's malloc() so that the 
loader can be copied up to 1MB.  It looks like you could patch malloc() in 
zfsboot.c to use 4*1024*1024 as heap_next and maybe 64*1024*1024 as heap_end 
(this assumes all machines that boot ZFS have at least 64MB of RAM, which is 
probably safe).

-- 
John Baldwin