From owner-freebsd-current@FreeBSD.ORG Sun Nov 7 04:26:50 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5C5D616A4CF; Sun, 7 Nov 2004 04:26:50 +0000 (GMT) Received: from picard.newmillennium.net.au (static-114.250.240.220.dsl.comindico.com.au [220.240.250.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id 27B6043D1F; Sun, 7 Nov 2004 04:26:49 +0000 (GMT) (envelope-from alastair@newmillennium.net.au) Received: from riker (riker.nmn.cafn [10.0.1.2])iA74QlfO086214; Sun, 7 Nov 2004 15:26:47 +1100 (EST) (envelope-from alastair@newmillennium.net.au) From: Sender: "Alastair D'Silva" To: "'Greg 'groggy' Lehey'" , "'Lukas Ertl'" Date: Sun, 7 Nov 2004 15:28:23 +1100 Organization: New Millennium Networking Message-ID: <011a01c4c482$37d958c0$0201000a@riker> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2616 In-Reply-To: <20041107024014.GY24507@wantadilla.lemis.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal cc: freebsd-current@FreeBSD.org Subject: RE: Gvinum RAID5 performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Nov 2004 04:26:50 -0000 > If this is correct, this is a very different strategy from > Vinum. If we're talking about the corresponding code, Vinum > locks the stripe and serializes access to that stripe only. > For normal sized volumes, this means almost no clashes. My point is that the access to the stripe should only be serialized under certain conditions. In my case, the ability to stream large files to/from the RAID5 volume is hampered by this situation. This is my understanding of what is happening: 1. Userland app requests to read in a big chunk of data 2. Somewhere in the OS, the request is broken into smaller chunks and issued to Gvinum 3. GVinum queues all requests issued to it 4. GVinum's worker walks the queue and starts processing the first request (blocking subsequent requests since they're likely to be in the same stripe) 5. The first request is retired, and the next request is processed If we follow the logic I outlined in the previous mail, we should instead have something like this: 1. Userland app requests to read in a big chunk of data 2. Somewhere in the OS, the request is broken into smaller chunks and issued to Gvinum 3. GVinum queues all requests issued to it 4. GVinum's worker walks the queue and starts processing the first request 5. Gvinum checks the next request, realizes that its a read and only other reads are pending for that stripe (repeat) 6. Read requests are retired (in no particular order, but the code that split it into smaller requests should handle that) In the first scenario, large sequential reads are first broken into smaller chunks, and each chunk is processed sequentially. In the second scenario, each smaller chunk is processed in parallel, so all the drives in the array are worked simultaneously. > > 3. When calculating parity as per (2), we should operate on whole > > blocks (as defined by the underlying device). This provides the > > benefit of being able to write a complete block to the > subdisk, so the > > underlying mechanism does not have to do a > read/update/write operation > > to write a partial block. > > I'm not sure what you're saying here. If it's a repeat of my > last sentence, yes, but only sometimes. With a stripe size > in the order of 300 kB, you're talking 1 or 2 MB per "block" > (i.e. stripe across all disks). That kind of write doesn't > happen very often. At the other end, all disks support a > "block" (or "sector") of 512 B, and that's the granularity of > the system. I'm referring here to the underlying blocks of the block device itself. My understanding of block devices is that they cannot operate on a part of a block - they must read/write the whole block in one operation. If we were to write only the data to be updated, the low-level driver (or maybe the drive itself, depending on the hardware and implementation) must first read the block into a buffer, update the relevant part of the buffer, then write the result back out. Since we had to read the whole block to begin with, we could use this information to construct the whole block to be written, so the underlying driver would not have to do the read operation. -- Alastair D'Silva mob: 0413 485 733 Networking Consultant fax: 0413 181 661 New Millennium Networking web: http://www.newmillennium.net.au