Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Nov 2009 16:46:54 -0800
From:      Matt Reimer <mattjreimer@gmail.com>
To:        fs@freebsd.org
Subject:   Current gptzfsboot limitations
Message-ID:  <f383264b0911201646s702c8aa4u5e50a71f93a9e4eb@mail.gmail.com>

index | next in thread | raw e-mail

[-- Attachment #1 --]
I've been analyzing gptzfsboot to see what its limitations are. I
think it should now work fine for a healthy pool with any number of
disks, with any type of vdev, whether single disk, stripe, mirror,
raidz or raidz2.

But there are currently several limitations (likely in loader.zfs
too), mostly due to the limited amount of memory available (< 640KB)
and the simple memory allocators used (a simple malloc() and
zfs_alloc_temp()).

1. gptzfsboot might fail to read compressed files on raidz/raidz2
pools. The reason is that the temporary buffer used for I/O
(zfs_temp_buf in zfsimpl.c) is 128KB by default, but a 128KB
compressed block will require a 128KB buffer to be allocated before
the I/O is done, leaving nothing for the raidz code further on. The
fix would be to make more the temporary buffer larger, but for some
reason it's not as simple as just changing the TEMP_SIZE define
(possibly a stack overflow results; more debugging needed).
Workaround: don't enable compression on your root filesystem (aka
bootfs).

2. gptzfsboot might fail to reconstruct a file that is read from a
degraded raidz/raidz2 pool, or if the file is corrupt somehow (i.e.
the pool is healthy but the checksums don't match). The reason again
is that the temporary buffer gets exhausted. I think this will only
happen in the case where more than one physical block is corrupt or
unreadable. The fix has several aspects: 1) make the temporary buffer
much larger, perhaps larger than 640KB; 2) change
zfssubr.c:vdev_raidz_read() to reuse the temp buffers it allocates
when possible; and 3) either restructure
zfssubr.c:vdev_raidz_reconstruct_pq() to only allocate its temporary
buffers once per I/O, or use a malloc that has free() implemented.
Workaround: repair your pool somehow (e.g. pxeboot) so one or no disks
are bad.

3. gptzfsboot might fail to boot from a degraded pool that has one or
more drives marked offline, removed, or faulted. The reason is that
vdev_probe() assumes that all vdevs are healthy, regardless of their
true state. gptzfsboot then will read from an offline/removed/faulted
vdev as if it were healthy, likely resulting in failed checksums,
resulting in the recovery code path being run in vdev_raidz_read(),
possibly leading to zfs_temp_buf exhaustion as in #2 above.

A partial patch for #3 is attached, but it is inadequate because it
only reads a vdev's status from the first device's (in BIOS order)
vdev_label, with the result that if the first device is marked
offline, gptzfsboot won't see this because only the other devices'
vdev_labels will indicate that the first device is offline. (Since
after a device is offlined no further writes will be made to the
device, its vdev_label is not updated to reflect that it's offline.)
To complete the patch it would be necessary to set each leaf vdev's
status from the newest vdev_label rather than from the first
vdev_label seen.

I think I've also hit a stack overflow a couple of times while debugging.

I don't know enough about the gptzfsboot/loader.zfs environment to
know whether the heap size could be easily enlarged, or whether there
is room for a real malloc() with free(). loader(8) seems to use the
malloc() in libstand. Can anyone shed some light on the memory
limitations and possible solutions?

I won't be able to spend much more time on this, but I wanted to pass
on what I've learned in case someone else has the time and boot fu to
take it the next step.

Matt

[-- Attachment #2 --]
--- zfs/zfsimpl.c.orig	2009-10-24 18:10:29.000000000 -0700
+++ zfs/zfsimpl.c	2009-11-20 16:44:49.000000000 -0800
@@ -396,6 +396,7 @@
 	vdev->v_read = read;
 	vdev->v_phys_read = 0;
 	vdev->v_read_priv = 0;
+	vdev->v_inited = 0;
 	STAILQ_INSERT_TAIL(&zfs_vdevs, vdev, v_alllink);
 
 	return (vdev);
@@ -411,6 +412,7 @@
 	vdev_t *vdev, *kid;
 	const unsigned char *kids;
 	int nkids, i;
+	uint64_t is_offline, is_faulted, is_degraded, is_removed;
 
 	if (nvlist_find(nvlist, ZPOOL_CONFIG_GUID,
 			DATA_TYPE_UINT64, 0, &guid)
@@ -478,6 +480,26 @@
 			vdev->v_name = strdup(type);
 		}
 	}
+
+	if (!nvlist_find(nvlist, ZPOOL_CONFIG_OFFLINE,
+			 DATA_TYPE_UINT64, 0, &is_offline) &&
+			 is_offline) {
+		vdev->v_state = VDEV_STATE_OFFLINE;
+	} else if (!nvlist_find(nvlist, ZPOOL_CONFIG_REMOVED,
+				DATA_TYPE_UINT64, 0, &is_removed) &&
+				is_removed) {
+		vdev->v_state = VDEV_STATE_REMOVED;
+	} else if (!nvlist_find(nvlist, ZPOOL_CONFIG_FAULTED,
+				DATA_TYPE_UINT64, 0, &is_faulted) &&
+				is_faulted) {
+		vdev->v_state = VDEV_STATE_FAULTED;
+	} else if (!nvlist_find(nvlist, ZPOOL_CONFIG_DEGRADED,
+				DATA_TYPE_UINT64, 0, &is_degraded) &&
+				is_degraded) {
+		vdev->v_state = VDEV_STATE_DEGRADED;
+	} else
+		vdev->v_state = VDEV_STATE_HEALTHY;
+
 	rc = nvlist_find(nvlist, ZPOOL_CONFIG_CHILDREN,
 			 DATA_TYPE_NVLIST_ARRAY, &nkids, &kids);
 	/*
@@ -591,7 +613,9 @@
 		"UNKNOWN",
 		"CLOSED",
 		"OFFLINE",
+		"REMOVED",
 		"CANT_OPEN",
+		"FAULTED",
 		"DEGRADED",
 		"ONLINE"
 	};
@@ -806,7 +830,7 @@
 		return (EIO);
 	}
 	vdev = vdev_find(guid);
-	if (vdev && vdev->v_state == VDEV_STATE_HEALTHY) {
+	if (vdev && vdev->v_inited) {
 		return (EIO);
 	}
 
@@ -836,7 +860,7 @@
 	if (vdev) {
 		vdev->v_phys_read = read;
 		vdev->v_read_priv = read_priv;
-		vdev->v_state = VDEV_STATE_HEALTHY;
+		vdev->v_inited = 1;
 	} else {
 		printf("ZFS: inconsistent nvlist contents\n");
 		return (EIO);
--- zfsimpl.h.orig	2009-05-16 03:48:20.000000000 -0700
+++ zfsimpl.h	2009-11-13 17:32:06.000000000 -0800
@@ -528,7 +528,6 @@
 #define	ZPOOL_CONFIG_DTL		"DTL"
 #define	ZPOOL_CONFIG_STATS		"stats"
 #define	ZPOOL_CONFIG_WHOLE_DISK		"whole_disk"
-#define	ZPOOL_CONFIG_OFFLINE		"offline"
 #define	ZPOOL_CONFIG_ERRCOUNT		"error_count"
 #define	ZPOOL_CONFIG_NOT_PRESENT	"not_present"
 #define	ZPOOL_CONFIG_SPARES		"spares"
@@ -538,6 +537,16 @@
 #define	ZPOOL_CONFIG_HOSTNAME		"hostname"
 #define	ZPOOL_CONFIG_TIMESTAMP		"timestamp" /* not stored on disk */
 
+/*
+ * The persistent vdev state is stored as separate values rather than a single
+ * 'vdev_state' entry.  This is because a device can be in multiple states, such
+ * as offline and degraded.
+ */
+#define ZPOOL_CONFIG_OFFLINE            "offline"
+#define ZPOOL_CONFIG_FAULTED            "faulted"
+#define ZPOOL_CONFIG_DEGRADED           "degraded"
+#define ZPOOL_CONFIG_REMOVED            "removed"
+
 #define	VDEV_TYPE_ROOT			"root"
 #define	VDEV_TYPE_MIRROR		"mirror"
 #define	VDEV_TYPE_REPLACING		"replacing"
@@ -570,7 +579,9 @@
 	VDEV_STATE_UNKNOWN = 0,	/* Uninitialized vdev			*/
 	VDEV_STATE_CLOSED,	/* Not currently open			*/
 	VDEV_STATE_OFFLINE,	/* Not allowed to open			*/
+        VDEV_STATE_REMOVED,	/* Explicitly removed from system	*/
 	VDEV_STATE_CANT_OPEN,	/* Tried to open, but failed		*/
+        VDEV_STATE_FAULTED,	/* External request to fault device	*/
 	VDEV_STATE_DEGRADED,	/* Replicated vdev with unhealthy kids	*/
 	VDEV_STATE_HEALTHY	/* Presumed good			*/
 } vdev_state_t;
@@ -1158,6 +1169,7 @@
 	vdev_phys_read_t *v_phys_read;	/* read from raw leaf vdev */
 	vdev_read_t	*v_read;	/* read from vdev */
 	void		*v_read_priv;	/* private data for read function */
+	int		v_inited;
 } vdev_t;
 
 /*
home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f383264b0911201646s702c8aa4u5e50a71f93a9e4eb>