From owner-freebsd-current  Tue Dec 11 21:52:26 2001
Delivered-To: freebsd-current@freebsd.org
Received: from monorchid.lemis.com (monorchid.lemis.com [192.109.197.75])
	by hub.freebsd.org (Postfix) with ESMTP
	id 293D137B405; Tue, 11 Dec 2001 21:52:13 -0800 (PST)
Received: by monorchid.lemis.com (Postfix, from userid 1004)
	id 2EAFD786E8; Wed, 12 Dec 2001 16:22:05 +1030 (CST)
Date: Wed, 12 Dec 2001 16:22:05 +1030
From: Greg Lehey <grog@FreeBSD.org>
To: Bernd Walter <ticso@cicely8.cicely.de>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	Wilko Bulte <wkb@freebie.xs4all.nl>, Mike Smith <msmith@FreeBSD.ORG>,
	Terry Lambert <tlambert2@mindspring.com>,
	Joerg Wunsch <joerg_wunsch@uriah.heep.sax.de>,
	freebsd-current@FreeBSD.ORG
Subject: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))
Message-ID: <20011212162205.I82733@monorchid.lemis.com>
References: <200112101754.fBAHsRV01202@mass.dis.org> <200112101813.fBAIDKo47460@apollo.backplane.com> <20011210192251.A65380@freebie.xs4all.nl> <200112101830.fBAIU4w47648@apollo.backplane.com> <20011211110633.M63585@monorchid.lemis.com> <20011211031120.G11774@cicely8.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20011211031120.G11774@cicely8.cicely.de>
User-Agent: Mutt/1.3.23i
Organization: The FreeBSD Project
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
WWW-Home-Page: http://www.FreeBSD.org/
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
> On Tue, Dec 11, 2001 at 11:06:33AM +1030, Greg Lehey wrote:
>> On Monday, 10 December 2001 at 10:30:04 -0800, Matthew Dillon wrote:
>>>
>>>>>     performance without it - for reading OR writing.  It doesn't matter
>>>>>     so much for RAID{1,10},  but it matters a whole lot for something like
>>>>>     RAID-5 where the difference between a spindle-synced read or write
>>>>>     and a non-spindle-synched read or write can be upwards of 35%.
>>>>
>>>> If you have RAID5 with I/O sizes that result in full-stripe operations.
>>>
>>>     Well, 'more then one disk' operations anyway, for random-I/O.  Caching
>>>     takes care of sequential I/O reasonably well but random-I/O goes down
>>>     the drain for writes if you aren't spindle synced, no matter what
>>>     the stripe size,
>>
>> Can you explain this?  I don't see it.  In FreeBSD, just about all I/O
>> goes to buffer cache.
>
> After waiting for the drives and not for vinum parity blocks.
>
>>>     and will go down the drain for reads if you cross a stripe -
>>>     something that is quite common I think.
>>
>> I think this is what Mike was referring to when talking about parity
>> calculation.  In any case, going across a stripe boundary is not a
>> good idea, though of course it can't be avoided.  That's one of the
>> arguments for large stripes.
>
> striped:
> If you have 512byte stripes and have 2 disks.
> You access 64k which is put into 2 32k transactions onto the disk.

Only if your software optimizes the transfers.  There are reasons why
it should not.  Without optimization, you get 128 individual
transfers.

> The wait time for the complete transaction is the worst of both,
> which is more than the average of a single disk.

Agreed.

> With spindle syncronisation the access time for both disks are
> beleaved to be identic and you get the same as with a single disk.

Correct.

> Linear speed could be about twice the speed of a single drive.  But
> this is more theoretic today than real.  The average transaction
> size per disk decreases with growing number of spindles and you get
> more transaction overhead.  Also the voice coil technology used in
> drives since many years add a random amount of time to the access
> time, which invalidates some of the spindle sync potential.  Plus it
> may break some benefits of precaching mechanisms in drives.  I'm
> almost shure there is no real performance gain with modern drives.

The real problem with this scenario is that you're missing a couple of
points:

1.  Typically it's not the latency that matters.  If you have to wait
    a few ms longer, that's not important.  What's interesting is the
    case of a heavily loaded system, where the throughput is much more
    important than the latency.

2.  Throughput is the data transferred per unit time.  There's active
    transfer time, nowadays in the order of 500 盜, and positioning
    time, in the order of 6 ms.  Clearly the fewer positioning
    operations, the better.  This means that you should want to put
    most transfers on a single spindle, not a single stripe.  To do
    this, you need big stripes.

> raid5:
> For a write you have two read transactions and two writes.

This is the way Vinum does it.  There are other possibilities:

1.  Always do full-stripe writes.  Then you don't need to read the old
    contents.

2.  Cache the parity blocks.  This is an optimization which I think
    would be very valuable, but which Vinum doesn't currently perform.

> There are easier things to raise performance.
> Ever wondered why people claim vinums raid5 writes are slow?
> The answer is astonishing simple:
> Vinum does striped based locking, while the ufs tries to lay out data
> mostly ascending sectors.
> What happens here is that the first write has to wait for two reads
> and two writes.
> If we have an ascending write it has to wait for the first write to
> finish, because the stripe is still locked.
> The first is unlocked after both physical writes are on disk.
> Now we start our two reads which are (thanks to drives precache)
> most likely in the drives cache - than we write.
>
> The problem here is that physical writes gets serialized and the drive
> has to wait a complete rotation between each.

Not if the data is in the drive cache.

> If we had a fine grained locking which only locks the accessed sectors
> in the parity we would be able to have more than a single ascending
> write transaction onto a single drive.

Hmm.  This is something I hadn't thought about.  Note that sequential
writes to a RAID-5 volume don't go to sequential addresses on the
spindles; they will work up to the end of the stripe on one spindle,
then start on the next spindle at the start of the stripe.  You can do
that as long as the address ranges in the parity block don't overlap,
but the larger the stripe, the greater the likelihood of this would
be. This might also explain the following observed behaviour:

1.  RAID-5 writes slow down when the stripe size gets > 256 kB or so.
    I don't know if this happens on all disks, but I've seen it often
    enough.

2.  rawio write performance is better than ufs write performance.
    rawio does "truly" random transfers, where ufs is a mixture.

Do you feel like changing the locking code?  It shouldn't be that much
work, and I'd be interested to see how much performance difference it
makes.

Note that there's another possible optimization here: delay the writes
by a certain period of time and coalesce them if possible.  I haven't
finished thinking about the implications.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message