From owner-freebsd-current@FreeBSD.ORG Sun Nov 7 01:04:56 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2DBAE16A4CE; Sun, 7 Nov 2004 01:04:56 +0000 (GMT) Received: from picard.newmillennium.net.au (static-114.250.240.220.dsl.comindico.com.au [220.240.250.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0291A43D1D; Sun, 7 Nov 2004 01:04:53 +0000 (GMT) (envelope-from alastair@newmillennium.net.au) Received: from riker (riker.nmn.cafn [10.0.1.2])iA714p19045670; Sun, 7 Nov 2004 12:04:51 +1100 (EST) (envelope-from alastair@newmillennium.net.au) From: Sender: "Alastair D'Silva" To: "'Greg 'groggy' Lehey'" , "'Lukas Ertl'" Date: Sun, 7 Nov 2004 12:06:26 +1100 Organization: New Millennium Networking Message-ID: <00a701c4c466$01acd9f0$0201000a@riker> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2616 In-Reply-To: <20041106232320.GI24507@wantadilla.lemis.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal cc: freebsd-current@FreeBSD.org Subject: RE: Gvinum RAID5 performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Nov 2004 01:04:56 -0000 > -----Original Message----- > From: Greg 'groggy' Lehey [mailto:grog@FreeBSD.org] > Sent: Sunday, 7 November 2004 10:23 AM > To: Lukas Ertl > Cc: freebsd@newmillennium.net.au; freebsd-current@FreeBSD.org > Subject: Re: Gvinum RAID5 performance > > 1. Too small a stripe size. If you (our anonymous user, who was > using a single dd process) have to perform multiple transfers for > a single request, the results will be slower. I'm using the recommended 279kb from the man page. > 2. There may be some overhead in GEOM that slows things down. If > this is the case, something should be done about it. (Disclaimer: I have only looked at the code, not put in any debugging to verify the situation. Also, my understanding is that the term "stripe" refers to the data in a plex which when read sequentially results in all disks being accessed exactly once, i.e. "A(n) B(n) C(n) P(n)" rather than blocks from a single subdisk, i.e. "A(n)", where (n) represents a group of contiguous blocks. Please correct me if I am wrong) I can see a pontential place for slowdown here . . . In geom_vinum_plex.c, line 575 /* * RAID5 sub-requests need to come in correct order, otherwise * we trip over the parity, as it might be overwritten by * another sub-request. */ if (pbp->bio_driver1 != NULL && gv_stripe_active(p, pbp)) { /* Park the bio on the waiting queue. */ pbp->bio_cflags |= GV_BIO_ONHOLD; bq = g_malloc(sizeof(*bq), M_WAITOK | M_ZERO); bq->bp = pbp; mtx_lock(&p->bqueue_mtx); TAILQ_INSERT_TAIL(&p->wqueue, bq, queue); mtx_unlock(&p->bqueue_mtx); } It seems we are holding back all requests to a currently active stripe, even if it is just a read and would never write anything back. I think the following conditions should apply: - If the current transactions on the stripe are reads, and we want to issue another read, let it through - If the current transactions on the stripe are reads, and we want to issue a write, queue it - If the current transactions on the stripe are writes, and we want to issue another write, queue it (but see below) - If the current transactions on the stripe are writes, and we want to issue a read, queue it if it overlaps the data being written, or if the plex is degraded and the request requires the parity to be read, otherwise, let it through We could also optimize writing a bit by doing the following: 1. To calculate parity, we could simply read the old data (that was about to be overwritten), and the old parity, and recalculate the parity based on that information, rather than reading in all the stripes (based on the assumption that the original parity was correct). This would still take approximately the same amount of time, but would leave the other disks in the stripe available for other I/O. 2. If there are two or more writes pending for the same stripe (that is, up to the point that the data|parity has been written), they should be condensed into a single operation so that there is a single write to the parity, rather than one write for each operation. This way, we should be able to get close to (N -1) * disk throughput for large sequential writes, without compromising the integrity of the parity on disk. 3. When calculating parity as per (2), we should operate on whole blocks (as defined by the underlying device). This provides the benefit of being able to write a complete block to the subdisk, so the underlying mechanism does not have to do a read/update/write operation to write a partial block. Comments? -- Alastair D'Silva mob: 0413 485 733 Networking Consultant fax: 0413 181 661 New Millennium Networking web: http://www.newmillennium.net.au