From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:22:07 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 941C81065670; Sun, 26 Feb 2012 14:22:07 +0000 (UTC) (envelope-from flo@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 0D1138FC0C; Sun, 26 Feb 2012 14:22:07 +0000 (UTC) Received: from nibbler-osx.fritz.box (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q1QEM4dd009659; Sun, 26 Feb 2012 14:22:05 GMT (envelope-from flo@FreeBSD.org) Message-ID: <4F4A400C.1030606@FreeBSD.org> Date: Sun, 26 Feb 2012 15:22:04 +0100 From: Florian Smeets User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20120216 Thunderbird/11.0 MIME-Version: 1.0 To: Attilio Rao References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225210339.GM55074@deviant.kiev.zoral.com.ua> <20120226141334.GU55074@deviant.kiev.zoral.com.ua> In-Reply-To: X-Enigmail-Version: 1.4a1pre Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigBB70FF77484EE4C06FD5CE12" Cc: Konstantin Belousov , arch@FreeBSD.org, Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:22:07 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigBB70FF77484EE4C06FD5CE12 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 26.02.12 15:16, Attilio Rao wrote: > Il 26 febbraio 2012 14:13, Konstantin Belousov ha= scritto: >> On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote: >>> Il 25 febbraio 2012 22:03, Konstantin Belousov = ha scritto: >>>> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: >>>>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha= scritto: >>>>>> On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: >>>>>>> Il 03 febbraio 2012 19:37, Konstantin Belousov ha scritto: >>>>>>>> FreeBSD I/O infrastructure has well known issue with deadlock ca= used >>>>>>>> by vnode lock order reversal when buffers supplied to read(2) or= >>>>>>>> write(2) syscalls are backed by mmaped file. >>>>>>>> >>>>>>>> I previously published the patches to convert i/o path to use VM= IO, >>>>>>>> based on the Jeff Roberson proposal, see >>>>>>>> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the= >>>>>>>> deadlock. Since that work is very intrusive and did not got any >>>>>>>> follow-up, it get stalled. >>>>>>>> >>>>>>>> Below is very lightweight patch which only goal is to fix deadlo= ck in >>>>>>>> the least intrusive way. This is possible after FreeBSD got the >>>>>>>> vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) = KPIs. >>>>>>>> http://people.freebsd.org/~kib/misc/vm1.3.patch >>>>>>> >>>>>>> Hi, >>>>>>> I was reviewing: >>>>>>> http://people.freebsd.org/~kib/misc/vm1.11.patch >>>>>>> >>>>>>> and I think it is great. It is simple enough and I don't have fur= ther >>>>>>> comments on it. >>>> Thank you. >>>> >>>> This spoiled an announce I intended to send this weekend :) >>>> >>>>>>> >>>>>>> However, as a side note, I was thinking if we could get one day a= t the >>>>>>> point to integrate rangelocks into vnodes lockmgr directly. >>>>>>> It would be a huge patch, rewrtiting the locking of several membe= rs of >>>>>>> vnodes likely, but I think it would be worth it in terms of clean= ess >>>>>>> of the interface and less overhead. Also, it would be interesting= to >>>>>>> consider merging rangelock implementation in ZFS' one, at some po= int. >>>>>> >>>>>> I personal opinion about rangelocks and many other VFS features we= >>>>>> currently have is that it is good idea in theory, but in practise = it >>>>>> tends to overcomplicate VFS. >>>>>> >>>>>> I'm in opinion that we should move as much stuff as we can to indi= vidual >>>>>> file systems. We try to implement everything in VFS itself in hope= that >>>>>> this will simplify file systems we have. It then turns out only on= e file >>>>>> system is really using this stuff (most of the time it is UFS) and= this >>>>>> is PITA for all the other file systems as well as maintaining VFS.= VFS >>>>>> became so complicated over the years that there are maybe few peop= le >>>>>> that can understand it, and every single change to VFS is a huge r= isk of >>>>>> potentially breaking some unrelated parts. >>>>> >>>>> I think this is questionable due to the following assets: >>>>> - If the problem is filesystems writers having trouble in >>>>> understanding the necessary locking we should really provide cleane= r >>>>> and more complete documentation. One would think the same with our = VM >>>>> subsystem, but at least in that case there is plenty of comments th= at >>>>> help understanding how to deal with vm_object, vm_pages locking dur= ing >>>>> their lifelines. >>>>> - Our primitives may be more complicated than the >>>>> 'all-in-the-filesystem' one, but at least they offer a complete and= >>>>> centralized view over the resources we have allocated in the whole >>>>> system and they allow building better policies about how to manage >>>>> them. One problem I see here, is that those policies are not fully >>>>> implemented, tuned or just got outdated, removing one of the highes= t >>>>> beneficial that we have by making vnodes so generic >>>>> >>>>> About the thing I mentioned myself: >>>>> - As long as the same path now has both range-locking and vnode >>>>> locking I don't see as a good idea to keep both separated forever. >>>>> Merging them seems to me an important evolution (not only helping >>>>> shrinking the number of primitives themselves but also introducing >>>>> less overhead and likely rewamped scalability for vnodes (but I thi= nk >>>>> this needs a deep investigation). >>>> The proper direction to move there is to designate the vnode lock fo= r >>>> the vnode structure protection, and have the range lock protect the >>>> i/o atomicity. This is somewhat done in the proposed patch (since >>>> now vnode lock does not protect the i/o operation, but only chunked >>>> i/o transactions inside the operation). >>>> >>>> The Jeff idea of using page cache as the source of i/o data (impleme= nted >>>> in the VM6 patchset) pushes the idea much further. E.g., the write >>>> does not obtain the write vnode lock typically (but sometimes it had= , >>>> to extend the vnode). >>>> >>>> Probably, I will revive VM6 after this change is landed. >>> >>> About that I guess we might be careful. >>> The first thing would be having a very scalable VM subsystem and >>> recent benchmarks have shown that this is not yet the case (Florian, >>> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also >>> with the vmcontention patch, shows a lot on contention on vm_object, >>> pmap lock and vm_page_queue_lock. We have some plans for every of >>> them, we will discuss on a separate thread if you prefer). This is >>> just to say, that we may need more work in underground areas to bring= >>> VM6 to the point it will really make a difference. >> >> The benchmarks that were done at that time demonstrated that VM6 do no= t >> cause regressions for e.g. buildworld time, and have a margin improvem= ents, >> around 10%, for some postgresql loads. >> >> Main benefit of the VM6 on UFS is that writers no longer block readers= >> for separate i/o ranges. Also, due to vm_page flags locking improvemen= ts, >> I suspect the VM6 backpressure code might be simplified and give even >> larger benefit right now. >> >> Anyway, I do not think that VM6 can be put into HEAD quickly, and I wa= nt >> to finish with VM1/prefaulting right now. >=20 > I was speaking about a different benchmark. > Florian made a lock_profile/hwpmc analysis on stock + vmcontention > patch for verifying where the biggest bottlenecks are. > Of course, it turns out that the most contended locks are all the ones > involved in VM, which is not surprising at all. >=20 > He can share numbers and insight I guess. All i did until now is run PostgreSQL with 128 client threads with lock_profiling [1] and hwpmc [2]. I haven't spent any time analyzing this, yet. [1] http://people.freebsd.org/~flo/vmc-lock-profiling-postgres-128-20120208.t= xt [2] http://people.freebsd.org/~flo/vmc-hwpmc-gprof-postgres-128-20120208.= txt --------------enigBB70FF77484EE4C06FD5CE12 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAk9KQAwACgkQapo8P8lCvwl0mgCg2+4H30fWR7qt3g6iIxlYN28W iNIAn2b6unvHqHukMX+Tdp8rtgn/4TP2 =jfVO -----END PGP SIGNATURE----- --------------enigBB70FF77484EE4C06FD5CE12--