From owner-freebsd-current@FreeBSD.ORG  Thu Dec  3 12:44:49 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 80BC7106566B;
	Thu,  3 Dec 2009 12:44:49 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-fx0-f218.google.com (mail-fx0-f218.google.com
	[209.85.220.218])
	by mx1.freebsd.org (Postfix) with ESMTP id B561D8FC1E;
	Thu,  3 Dec 2009 12:44:48 +0000 (UTC)
Received: by fxm10 with SMTP id 10so1289543fxm.34
	for <multiple recipients>; Thu, 03 Dec 2009 04:44:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:sender:received:in-reply-to
	:references:date:x-google-sender-auth:message-id:subject:from:to:cc
	:content-type:content-transfer-encoding;
	bh=bv9pbsRq8fsL861XSMJsU3gf9LAmwoetGRAOfJfoAlw=;
	b=lvIsgScLxtrJBnHGb5ZMFyWxiMOr4HYjV7vOznm5dqPC02Q3R2LE5j4sNnUUGygOQW
	YM860LRJQDVHuH3osD6YlemlCuY/VwFLTMiO2Cg0lamhImiXMxU6DWXeN9mRmmTV7m4L
	7iZBnu235Q9xdL/Ypq0LrqSA7rGc8hTLrslBM=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=V5DsiGdJMCW+aL/YMcRB+W643l/RP++uxKXISGCKoKBmnQk9aUwP6luospacVLwT0P
	13loR8XuvbFIFSWcx8XVXQgrYrpMKeJbC0IJNNTK3vyRgfpwNJZdSHvP6yHGQ77ETaP2
	XQi/KyTWhUqBOe30RjrRLnVB6vG7+spsX53k4=
MIME-Version: 1.0
Sender: asmrookie@gmail.com
Received: by 10.223.144.195 with SMTP id a3mr214184fav.103.1259844287642; Thu, 
	03 Dec 2009 04:44:47 -0800 (PST)
In-Reply-To: <4B170FCB.3030102@FreeBSD.org>
References: <1259583785.00188655.1259572802@10.7.7.3>
	<1259659388.00189017.1259647802@10.7.7.3>
	<1259691809.00189274.1259681402@10.7.7.3>
	<1259695381.00189283.1259682004@10.7.7.3>
	<4B170FCB.3030102@FreeBSD.org>
Date: Thu, 3 Dec 2009 13:44:47 +0100
X-Google-Sender-Auth: e1d8b015177ef64b
Message-ID: <3bbf2fe10912030444w82707e3l30e2245c2ba64daa@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Alexander Motin <mav@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: FreeBSD-Current <freebsd-current@freebsd.org>,
	Ivan Voras <ivoras@freebsd.org>
Subject: Re: NCQ vs UFS/ZFS benchmark [Was: Re: FreeBSD 8.0 Performance (at 
	Phoronix)]
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Dec 2009 12:44:49 -0000

2009/12/3 Alexander Motin <mav@freebsd.org>:
> Ivan Voras wrote:
>> If you have a drive to play with, could you also check UFS vs ZFS on
>> both ATA & AHCI? To try and see if the IO scheduling of ZFS plays nicely=
.
>>
>> For benchmarks I suggest blogbench and bonnie++ (in ports) and if you
>> want to bother, randomio, http://arctic.org/~dean/randomio .
>
> I have looked on randomio and found that it is also tuned to test
> physical drive, and it does almost the same as raidtest. The main
> difference that raidtest uses pre-generated test patterns, so it's
> results are much more repeatable. What bonnie++ does is another
> question, I prefer trust results which I can explain.
>
> So I have spent several hours to quickly compare UFS and ZFS in several
> scenarios, using ata(4) and ahci(4) drivers. It is not a strict
> research, but I have checked every digit at least twice, some unexpected
> or deviating ones even more.
>
> I have pre-written 20GB file on empty file systems and used raidtest to
> generate random rix of 10000 read/write requests of random size (512B -
> 128KB) to those files. Every single run took about a minute, total
> transfer size per run was about 600MB. I have used the same request
> pattern in all tests.
>
> Test 1: raidtest with O_DIRECT flag (default) on UFS file system:
> ata(4), 1 process =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 tps: 7=
0
> ata(4), 32 processes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tps: 71
> ahci(4), 1 process =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tps: 7=
2
> ahci(4), 32 processes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 tps: 81
>
> gstat shown that most of time only one request at a time was running on
> disk. Looks like read or read-modify-write operations (due to many short
> writes in test pattern) are heavily serialized in UFS, even when several
> processes working with the same file. It has almost eliminated effect of
> NCQ in this test.
>
> Test 2: Same as before, but without O_DIRECT flag:
> ata(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 =C2=A0tps: 78
> ata(4), 1 process, second =C2=A0 =C2=A0 =C2=A0 tps: 469
> ata(4), 32 processes, first =C2=A0 =C2=A0 tps: 83
> ata(4), 32 processes, second =C2=A0 =C2=A0tps: 475
> ahci(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 tps: 79
> ahci(4), 1 process, second =C2=A0 =C2=A0 =C2=A0tps: 476
> ahci(4), 32 processes, first =C2=A0 =C2=A0tps: 93
> ahci(4), 32 processes, second =C2=A0 tps: 488
>
> Without O_DIRECT flag UFS was able to fit all accessed information into
> buffer cache on second run. Second run uses buffer cache for all reads,
> writes are not serialized, but NCQ effect is minimal in this situation.
> First run is still mostly serialized.
>
> Test 3: Same as 2, but with ZFS (i386 without tuning)
> ata(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 =C2=A0tps: 75
> ata(4), 1 process, second =C2=A0 =C2=A0 =C2=A0 tps: 73
> ata(4), 32 processes, first =C2=A0 =C2=A0 tps: 98
> ata(4), 32 processes, second =C2=A0 =C2=A0tps: 97
> ahci(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 tps: 77
> ahci(4), 1 process, second =C2=A0 =C2=A0 =C2=A0tps: 80
> ahci(4), 32 processes, first =C2=A0 =C2=A0tps: 139
> ahci(4), 32 processes, second =C2=A0 tps: 142
>
> Data doesn't fit into cache. Multiple parallel requests give some effect
> even with legacy driver, but with NCQ enabled it gives much more, almost
> doubling performance!
>
> Teste 4: Same as 3, but with kmem_size=3D1900M and arc_max=3D1700M.
> ata(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 =C2=A0tps: 90
> ata(4), 1 process, second =C2=A0 =C2=A0 =C2=A0 tps: ~160-300
> ata(4), 32 processes, first =C2=A0 =C2=A0 tps: 112
> ata(4), 32 processes, second =C2=A0 =C2=A0tps: ~190-322
> ahci(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 tps: 90
> ahci(4), 1 process, second =C2=A0 =C2=A0 =C2=A0tps: ~140-300
> ahci(4), 32 processes, first =C2=A0 =C2=A0tps: 180
> ahci(4), 32 processes, second =C2=A0 tps: ~280-550
>
> Data slightly cached on first run and heavily cached on second. But even
> such (maximum of I can dedicate on my i386) amount of memory it is not
> enough to cache all data. Second run gives different device access
> pattern each time and very random results.
>
> Test 5: Same as 3, but with 2 disks:
> ata(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 =C2=A0tps: 80
> ata(4), 1 process, second =C2=A0 =C2=A0 =C2=A0 tps: 79
> ata(4), 32 processes, first =C2=A0 =C2=A0 tps: 186
> ata(4), 32 processes, second =C2=A0 =C2=A0tps: 181
> ahci(4), 1 process, first =C2=A0 =C2=A0 =C2=A0 tps: 79
> ahci(4), 1 process, second =C2=A0 =C2=A0 =C2=A0tps: 110
> ahci(4), 32 processes, first =C2=A0 =C2=A0tps: 287
> ahci(4), 32 processes, second =C2=A0 tps: 290
>
> Data doesn't fit into cache. Second disk gives almost no improvements
> for serialized requests. Multiple parallel requests double speed even
> with legacy driver, because of spreading requests between drives. Adding
> NCQ support significantly rises speed even more.
>
> As conclusion:
> - in this particular test ZFS scaled well with parallel requests,
> effectively using multiple disks. NCQ shown great benefits. But i386
> constraints are significantly limited ZFS caching abilities.
> - UFS behaves very poorly in this test. Even with parallel workload it
> often serializes device accesses. May be results would be different if
> there would be separate file for each process, or with some other
> options, but I think pattern I have used is also possible in some
> applications. Only benefit UFS shown here is more effective memory
> management on i386, leading to higher cache effectiveness.

I think that the problem is that we serialize on the vnode for
VOP_READ/VOP_WRITE  on the same file.
Probabilly byte-range locking kib is implementing in the VFS (and that
ZFS already implements independently) will lead to better results (and
that's not surprising ZFS is doing very well on this workload).
I suggest to run the test with different files, you may not see this
bottleneck effect on the scalability.

Attilio


--=20
Peace can only be achieved by understanding - A. Einstein