From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:22:07 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 941C81065670;
	Sun, 26 Feb 2012 14:22:07 +0000 (UTC) (envelope-from flo@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 0D1138FC0C;
	Sun, 26 Feb 2012 14:22:07 +0000 (UTC)
Received: from nibbler-osx.fritz.box (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q1QEM4dd009659;
	Sun, 26 Feb 2012 14:22:05 GMT (envelope-from flo@FreeBSD.org)
Message-ID: <4F4A400C.1030606@FreeBSD.org>
Date: Sun, 26 Feb 2012 15:22:04 +0100
From: Florian Smeets <flo@FreeBSD.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
	rv:11.0) Gecko/20120216 Thunderbird/11.0
MIME-Version: 1.0
To: Attilio Rao <attilio@FreeBSD.org>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225210339.GM55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
	<20120226141334.GU55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAme7Joe1hd05VbvmA4C7_9q26ZQncBKtdBBjGawHqrHQ@mail.gmail.com>
In-Reply-To: <CAJ-FndAme7Joe1hd05VbvmA4C7_9q26ZQncBKtdBBjGawHqrHQ@mail.gmail.com>
X-Enigmail-Version: 1.4a1pre
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature";
	boundary="------------enigBB70FF77484EE4C06FD5CE12"
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@FreeBSD.org,
	Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:22:07 -0000

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigBB70FF77484EE4C06FD5CE12
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 26.02.12 15:16, Attilio Rao wrote:
> Il 26 febbraio 2012 14:13, Konstantin Belousov <kostikbel@gmail.com> ha=
 scritto:
>> On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote:
>>> Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> =
ha scritto:
>>>> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
>>>>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha=
 scritto:
>>>>>> On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
>>>>>>> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.c=
om> ha scritto:
>>>>>>>> FreeBSD I/O infrastructure has well known issue with deadlock ca=
used
>>>>>>>> by vnode lock order reversal when buffers supplied to read(2) or=

>>>>>>>> write(2) syscalls are backed by mmaped file.
>>>>>>>>
>>>>>>>> I previously published the patches to convert i/o path to use VM=
IO,
>>>>>>>> based on the Jeff Roberson proposal, see
>>>>>>>> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the=

>>>>>>>> deadlock. Since that work is very intrusive and did not got any
>>>>>>>> follow-up, it get stalled.
>>>>>>>>
>>>>>>>> Below is very lightweight patch which only goal is to fix deadlo=
ck in
>>>>>>>> the least intrusive way. This is possible after FreeBSD got the
>>>>>>>> vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) =
KPIs.
>>>>>>>> http://people.freebsd.org/~kib/misc/vm1.3.patch
>>>>>>>
>>>>>>> Hi,
>>>>>>> I was reviewing:
>>>>>>> http://people.freebsd.org/~kib/misc/vm1.11.patch
>>>>>>>
>>>>>>> and I think it is great. It is simple enough and I don't have fur=
ther
>>>>>>> comments on it.
>>>> Thank you.
>>>>
>>>> This spoiled an announce I intended to send this weekend :)
>>>>
>>>>>>>
>>>>>>> However, as a side note, I was thinking if we could get one day a=
t the
>>>>>>> point to integrate rangelocks into vnodes lockmgr directly.
>>>>>>> It would be a huge patch, rewrtiting the locking of several membe=
rs of
>>>>>>> vnodes likely, but I think it would be worth it in terms of clean=
ess
>>>>>>> of the interface and less overhead. Also, it would be interesting=
 to
>>>>>>> consider merging rangelock implementation in ZFS' one, at some po=
int.
>>>>>>
>>>>>> I personal opinion about rangelocks and many other VFS features we=

>>>>>> currently have is that it is good idea in theory, but in practise =
it
>>>>>> tends to overcomplicate VFS.
>>>>>>
>>>>>> I'm in opinion that we should move as much stuff as we can to indi=
vidual
>>>>>> file systems. We try to implement everything in VFS itself in hope=
 that
>>>>>> this will simplify file systems we have. It then turns out only on=
e file
>>>>>> system is really using this stuff (most of the time it is UFS) and=
 this
>>>>>> is PITA for all the other file systems as well as maintaining VFS.=
 VFS
>>>>>> became so complicated over the years that there are maybe few peop=
le
>>>>>> that can understand it, and every single change to VFS is a huge r=
isk of
>>>>>> potentially breaking some unrelated parts.
>>>>>
>>>>> I think this is questionable due to the following assets:
>>>>> - If the problem is filesystems writers having trouble in
>>>>> understanding the necessary locking we should really provide cleane=
r
>>>>> and more complete documentation. One would think the same with our =
VM
>>>>> subsystem, but at least in that case there is plenty of comments th=
at
>>>>> help understanding how to deal with vm_object, vm_pages locking dur=
ing
>>>>> their lifelines.
>>>>> - Our primitives may be more complicated than the
>>>>> 'all-in-the-filesystem' one, but at least they offer a complete and=

>>>>> centralized view over the resources we have allocated in the whole
>>>>> system and they allow building better policies about how to manage
>>>>> them. One problem I see here, is that those policies are not fully
>>>>> implemented, tuned or just got outdated, removing one of the highes=
t
>>>>> beneficial that we have by making vnodes so generic
>>>>>
>>>>> About the thing I mentioned myself:
>>>>> - As long as the same path now has both range-locking and vnode
>>>>> locking I don't see as a good idea to keep both separated forever.
>>>>> Merging them seems to me an important evolution (not only helping
>>>>> shrinking the number of primitives themselves but also introducing
>>>>> less overhead and likely rewamped scalability for vnodes (but I thi=
nk
>>>>> this needs a deep investigation).
>>>> The proper direction to move there is to designate the vnode lock fo=
r
>>>> the vnode structure protection, and have the range lock protect the
>>>> i/o atomicity. This is somewhat done in the proposed patch (since
>>>> now vnode lock does not protect the i/o operation, but only chunked
>>>> i/o transactions inside the operation).
>>>>
>>>> The Jeff idea of using page cache as the source of i/o data (impleme=
nted
>>>> in the VM6 patchset) pushes the idea much further. E.g., the write
>>>> does not obtain the write vnode lock typically (but sometimes it had=
,
>>>> to extend the vnode).
>>>>
>>>> Probably, I will revive VM6 after this change is landed.
>>>
>>> About that I guess we might be careful.
>>> The first thing would be having a very scalable VM subsystem and
>>> recent benchmarks have shown that this is not yet the case (Florian,
>>> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
>>> with the vmcontention patch, shows a lot on contention on vm_object,
>>> pmap lock and vm_page_queue_lock. We have some plans for every of
>>> them, we will discuss on a separate thread if you prefer). This is
>>> just to say, that we may need more work in underground areas to bring=

>>> VM6 to the point it will really make a difference.
>>
>> The benchmarks that were done at that time demonstrated that VM6 do no=
t
>> cause regressions for e.g. buildworld time, and have a margin improvem=
ents,
>> around 10%, for some postgresql loads.
>>
>> Main benefit of the VM6 on UFS is that writers no longer block readers=

>> for separate i/o ranges. Also, due to vm_page flags locking improvemen=
ts,
>> I suspect the VM6 backpressure code might be simplified and give even
>> larger benefit right now.
>>
>> Anyway, I do not think that VM6 can be put into HEAD quickly, and I wa=
nt
>> to finish with VM1/prefaulting right now.
>=20
> I was speaking about a different benchmark.
> Florian made a lock_profile/hwpmc analysis on stock + vmcontention
> patch for verifying where the biggest bottlenecks are.
> Of course, it turns out that the most contended locks are all the ones
> involved in VM, which is not surprising at all.
>=20
> He can share numbers and insight I guess.

All i did until now is run PostgreSQL with 128 client threads with
lock_profiling [1] and hwpmc [2]. I haven't spent any time analyzing
this, yet.

[1]
http://people.freebsd.org/~flo/vmc-lock-profiling-postgres-128-20120208.t=
xt
[2] http://people.freebsd.org/~flo/vmc-hwpmc-gprof-postgres-128-20120208.=
txt


--------------enigBB70FF77484EE4C06FD5CE12
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----

iEYEARECAAYFAk9KQAwACgkQapo8P8lCvwl0mgCg2+4H30fWR7qt3g6iIxlYN28W
iNIAn2b6unvHqHukMX+Tdp8rtgn/4TP2
=jfVO
-----END PGP SIGNATURE-----

--------------enigBB70FF77484EE4C06FD5CE12--