From owner-freebsd-current@FreeBSD.ORG  Sun Nov  7 04:26:50 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 5C5D616A4CF; Sun,  7 Nov 2004 04:26:50 +0000 (GMT)
Received: from picard.newmillennium.net.au
	(static-114.250.240.220.dsl.comindico.com.au [220.240.250.114])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 27B6043D1F; Sun,  7 Nov 2004 04:26:49 +0000 (GMT)
	(envelope-from alastair@newmillennium.net.au)
Received: from riker (riker.nmn.cafn [10.0.1.2])iA74QlfO086214;
	Sun, 7 Nov 2004 15:26:47 +1100 (EST)
	(envelope-from alastair@newmillennium.net.au)
From: <freebsd@newmillennium.net.au>
Sender: "Alastair D'Silva" <alastair@newmillennium.net.au>
To: "'Greg 'groggy' Lehey'" <grog@FreeBSD.org>,
	"'Lukas Ertl'" <le@FreeBSD.org>
Date: Sun, 7 Nov 2004 15:28:23 +1100
Organization: New Millennium Networking
Message-ID: <011a01c4c482$37d958c0$0201000a@riker>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook, Build 10.0.2616
In-Reply-To: <20041107024014.GY24507@wantadilla.lemis.com>
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
Importance: Normal
cc: freebsd-current@FreeBSD.org
Subject: RE: Gvinum RAID5 performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Nov 2004 04:26:50 -0000

> If this is correct, this is a very different strategy from 
> Vinum.  If we're talking about the corresponding code, Vinum 
> locks the stripe and serializes access to that stripe only.  
> For normal sized volumes, this means almost no clashes.

My point is that the access to the stripe should only be serialized
under certain conditions.

In my case, the ability to stream large files to/from the RAID5 volume
is hampered by this situation.


This is my understanding of what is happening:

1. Userland app requests to read in a big chunk of data
2. Somewhere in the OS, the request is broken into smaller chunks and
issued to Gvinum
3. GVinum queues all requests issued to it
4. GVinum's worker walks the queue and starts processing the first
request (blocking subsequent requests since they're likely to be in the
same stripe)
5. The first request is retired, and the next request is processed

If we follow the logic I outlined in the previous mail, we should
instead have something like this:
1. Userland app requests to read in a big chunk of data
2. Somewhere in the OS, the request is broken into smaller chunks and
issued to Gvinum
3. GVinum queues all requests issued to it
4. GVinum's worker walks the queue and starts processing the first
request
5. Gvinum checks the next request, realizes that its a read and only
other reads are pending for that stripe (repeat)
6. Read requests are retired (in no particular order, but the code that
split it into smaller requests should handle that) 

In the first scenario, large sequential reads are first broken into
smaller chunks, and each chunk is processed sequentially.

In the second scenario, each smaller chunk is processed in parallel, so
all the drives in the array are worked simultaneously.


> > 3. When calculating parity as per (2), we should operate on whole 
> > blocks (as defined by the underlying device). This provides the 
> > benefit of being able to write a complete block to the 
> subdisk, so the 
> > underlying mechanism does not have to do a 
> read/update/write operation 
> > to write a partial block.
> 
> I'm not sure what you're saying here.  If it's a repeat of my 
> last sentence, yes, but only sometimes.  With a stripe size 
> in the order of 300 kB, you're talking 1 or 2 MB per "block" 
> (i.e. stripe across all disks).  That kind of write doesn't 
> happen very often.  At the other end, all disks support a 
> "block" (or "sector") of 512 B, and that's the granularity of 
> the system.  

I'm referring here to the underlying blocks of the block device itself.

My understanding of block devices is that they cannot operate on a part
of a block - they must read/write the whole block in one operation. If
we were to write only the data to be updated, the low-level driver (or
maybe the drive itself, depending on the hardware and implementation)
must first read the block into a buffer, update the relevant part of the
buffer, then write the result back out. Since we had to read the whole
block to begin with, we could use this information to construct the
whole block to be written, so the underlying driver would not have to do
the read operation. 

-- 
Alastair D'Silva           mob: 0413 485 733
Networking Consultant      fax: 0413 181 661
New Millennium Networking  web: http://www.newmillennium.net.au