Date: Sun, 7 Nov 2004 11:40:59 +0100 (CET) From: Lukas Ertl <le@FreeBSD.org> To: freebsd@newmillennium.net.au Cc: freebsd-current@FreeBSD.org Subject: RE: Gvinum RAID5 performance Message-ID: <20041107113342.K570@korben.prv.univie.ac.at> In-Reply-To: <00a701c4c466$01acd9f0$0201000a@riker> References: <00a701c4c466$01acd9f0$0201000a@riker>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 7 Nov 2004 freebsd@newmillennium.net.au wrote:
> In geom_vinum_plex.c, line 575
>
> /*
> * RAID5 sub-requests need to come in correct order, otherwise
> * we trip over the parity, as it might be overwritten by
> * another sub-request.
> */
> if (pbp->bio_driver1 != NULL &&
> gv_stripe_active(p, pbp)) {
> /* Park the bio on the waiting queue. */
> pbp->bio_cflags |= GV_BIO_ONHOLD;
> bq = g_malloc(sizeof(*bq), M_WAITOK | M_ZERO);
> bq->bp = pbp;
> mtx_lock(&p->bqueue_mtx);
> TAILQ_INSERT_TAIL(&p->wqueue, bq, queue);
> mtx_unlock(&p->bqueue_mtx);
> }
>
> It seems we are holding back all requests to a currently active stripe,
> even if it is just a read and would never write anything back.
No, only writes are held back. pbp->bio_driver1 is NULL when it's a
normal read.
> 1. To calculate parity, we could simply read the old data (that was
> about to be overwritten), and the old parity, and recalculate the parity
> based on that information, rather than reading in all the stripes (based
> on the assumption that the original parity was correct). This would
> still take approximately the same amount of time, but would leave the
> other disks in the stripe available for other I/O.
That's how it's already done: old parity, old data is read. New parity,
new data is written.
> 2. If there are two or more writes pending for the same stripe (that is,
> up to the point that the data|parity has been written), they should be
> condensed into a single operation so that there is a single write to the
> parity, rather than one write for each operation. This way, we should be
> able to get close to (N -1) * disk throughput for large sequential
> writes, without compromising the integrity of the parity on disk.
>
> 3. When calculating parity as per (2), we should operate on whole blocks
> (as defined by the underlying device). This provides the benefit of
> being able to write a complete block to the subdisk, so the underlying
> mechanism does not have to do a read/update/write operation to write a
> partial block.
These are interesting ideas and I'm gonna think about it.
thanks,
le
--
Lukas Ertl http://homepage.univie.ac.at/l.ertl/
le@FreeBSD.org http://people.freebsd.org/~le/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20041107113342.K570>
