From owner-freebsd-current@FreeBSD.ORG  Sun Nov  7 01:04:56 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2DBAE16A4CE; Sun,  7 Nov 2004 01:04:56 +0000 (GMT)
Received: from picard.newmillennium.net.au
	(static-114.250.240.220.dsl.comindico.com.au [220.240.250.114])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 0291A43D1D; Sun,  7 Nov 2004 01:04:53 +0000 (GMT)
	(envelope-from alastair@newmillennium.net.au)
Received: from riker (riker.nmn.cafn [10.0.1.2])iA714p19045670;
	Sun, 7 Nov 2004 12:04:51 +1100 (EST)
	(envelope-from alastair@newmillennium.net.au)
From: <freebsd@newmillennium.net.au>
Sender: "Alastair D'Silva" <alastair@newmillennium.net.au>
To: "'Greg 'groggy' Lehey'" <grog@FreeBSD.org>,
	"'Lukas Ertl'" <le@FreeBSD.org>
Date: Sun, 7 Nov 2004 12:06:26 +1100
Organization: New Millennium Networking
Message-ID: <00a701c4c466$01acd9f0$0201000a@riker>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook, Build 10.0.2616
In-Reply-To: <20041106232320.GI24507@wantadilla.lemis.com>
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
Importance: Normal
cc: freebsd-current@FreeBSD.org
Subject: RE: Gvinum RAID5 performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Nov 2004 01:04:56 -0000

> -----Original Message-----
> From: Greg 'groggy' Lehey [mailto:grog@FreeBSD.org] 
> Sent: Sunday, 7 November 2004 10:23 AM
> To: Lukas Ertl
> Cc: freebsd@newmillennium.net.au; freebsd-current@FreeBSD.org
> Subject: Re: Gvinum RAID5 performance
> 
> 1.  Too small a stripe size.  If you (our anonymous user, who was
>     using a single dd process) have to perform multiple transfers for
>     a single request, the results will be slower.

I'm using the recommended 279kb from the man page.

> 2.  There may be some overhead in GEOM that slows things down.  If
>     this is the case, something should be done about it.

(Disclaimer: I have only looked at the code, not put in any debugging to
verify the situation. Also, my understanding is that the term "stripe"
refers to the data in a plex which when read sequentially results in all
disks being accessed exactly once, i.e. "A(n) B(n) C(n) P(n)" rather
than blocks from a single subdisk, i.e. "A(n)", where (n) represents a
group of contiguous blocks. Please correct me if I am wrong)

I can see a pontential place for slowdown here . . . 

In geom_vinum_plex.c, line 575

/*
 * RAID5 sub-requests need to come in correct order, otherwise
 * we trip over the parity, as it might be overwritten by
 * another sub-request.
 */
if (pbp->bio_driver1 != NULL &&
    gv_stripe_active(p, pbp)) {
	/* Park the bio on the waiting queue. */
	pbp->bio_cflags |= GV_BIO_ONHOLD;
	bq = g_malloc(sizeof(*bq), M_WAITOK | M_ZERO);
	bq->bp = pbp;
	mtx_lock(&p->bqueue_mtx);
	TAILQ_INSERT_TAIL(&p->wqueue, bq, queue);
	mtx_unlock(&p->bqueue_mtx);
}

It seems we are holding back all requests to a currently active stripe,
even if it is just a read and would never write anything back. I think
the following conditions should apply:

- If the current transactions on the stripe are reads, and we want to
issue another read, let it through
- If the current transactions on the stripe are reads, and we want to
issue a write, queue it
- If the current transactions on the stripe are writes, and we want to
issue another write, queue it (but see below)
- If the current transactions on the stripe are writes, and we want to
issue a read, queue it if it overlaps the data being written, or if the
plex is degraded and the request requires the parity to be read,
otherwise, let it through


We could also optimize writing a bit by doing the following:

1. To calculate parity, we could simply read the old data (that was
about to be overwritten), and the old parity, and recalculate the parity
based on that information, rather than reading in all the stripes (based
on the assumption that the original parity was correct). This would
still take approximately the same amount of time, but would leave the
other disks in the stripe available for other I/O.

2. If there are two or more writes pending for the same stripe (that is,
up to the point that the data|parity has been written), they should be
condensed into a single operation so that there is a single write to the
parity, rather than one write for each operation. This way, we should be
able to get close to (N -1) * disk throughput for large sequential
writes, without compromising the integrity of the parity on disk.

3. When calculating parity as per (2), we should operate on whole blocks
(as defined by the underlying device). This provides the benefit of
being able to write a complete block to the subdisk, so the underlying
mechanism does not have to do a read/update/write operation to write a
partial block.


Comments?

-- 
Alastair D'Silva           mob: 0413 485 733
Networking Consultant      fax: 0413 181 661
New Millennium Networking  web: http://www.newmillennium.net.au