From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:13:39 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 68652106564A; Sun, 26 Feb 2012 14:13:39 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id ECF8D8FC0A; Sun, 26 Feb 2012 14:13:38 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1QEDYW4068821; Sun, 26 Feb 2012 16:13:34 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q1QEDYZe027147; Sun, 26 Feb 2012 16:13:34 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1QEDYC5027146; Sun, 26 Feb 2012 16:13:34 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 26 Feb 2012 16:13:34 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120226141334.GU55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225210339.GM55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="hZWqkIq97iJ4fJXE" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Florian Smeets , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:13:39 -0000 --hZWqkIq97iJ4fJXE Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote: > Il 25 febbraio 2012 22:03, Konstantin Belousov ha s= critto: > > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: > >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha sc= ritto: > >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: > >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov ha scritto: > >> >> > FreeBSD I/O infrastructure has well known issue with deadlock cau= sed > >> >> > by vnode lock order reversal when buffers supplied to read(2) or > >> >> > write(2) syscalls are backed by mmaped file. > >> >> > > >> >> > I previously published the patches to convert i/o path to use VMI= O, > >> >> > based on the Jeff Roberson proposal, see > >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the > >> >> > deadlock. Since that work is very intrusive and did not got any > >> >> > follow-up, it get stalled. > >> >> > > >> >> > Below is very lightweight patch which only goal is to fix deadloc= k in > >> >> > the least intrusive way. This is possible after FreeBSD got the > >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) K= PIs. > >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch > >> >> > >> >> Hi, > >> >> I was reviewing: > >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch > >> >> > >> >> and I think it is great. It is simple enough and I don't have furth= er > >> >> comments on it. > > Thank you. > > > > This spoiled an announce I intended to send this weekend :) > > > >> >> > >> >> However, as a side note, I was thinking if we could get one day at = the > >> >> point to integrate rangelocks into vnodes lockmgr directly. > >> >> It would be a huge patch, rewrtiting the locking of several members= of > >> >> vnodes likely, but I think it would be worth it in terms of cleaness > >> >> of the interface and less overhead. Also, it would be interesting to > >> >> consider merging rangelock implementation in ZFS' one, at some poin= t. > >> > > >> > I personal opinion about rangelocks and many other VFS features we > >> > currently have is that it is good idea in theory, but in practise it > >> > tends to overcomplicate VFS. > >> > > >> > I'm in opinion that we should move as much stuff as we can to indivi= dual > >> > file systems. We try to implement everything in VFS itself in hope t= hat > >> > this will simplify file systems we have. It then turns out only one = file > >> > system is really using this stuff (most of the time it is UFS) and t= his > >> > is PITA for all the other file systems as well as maintaining VFS. V= FS > >> > became so complicated over the years that there are maybe few people > >> > that can understand it, and every single change to VFS is a huge ris= k of > >> > potentially breaking some unrelated parts. > >> > >> I think this is questionable due to the following assets: > >> - If the problem is filesystems writers having trouble in > >> understanding the necessary locking we should really provide cleaner > >> and more complete documentation. One would think the same with our VM > >> subsystem, but at least in that case there is plenty of comments that > >> help understanding how to deal with vm_object, vm_pages locking during > >> their lifelines. > >> - Our primitives may be more complicated than the > >> 'all-in-the-filesystem' one, but at least they offer a complete and > >> centralized view over the resources we have allocated in the whole > >> system and they allow building better policies about how to manage > >> them. One problem I see here, is that those policies are not fully > >> implemented, tuned or just got outdated, removing one of the highest > >> beneficial that we have by making vnodes so generic > >> > >> About the thing I mentioned myself: > >> - As long as the same path now has both range-locking and vnode > >> locking I don't see as a good idea to keep both separated forever. > >> Merging them seems to me an important evolution (not only helping > >> shrinking the number of primitives themselves but also introducing > >> less overhead and likely rewamped scalability for vnodes (but I think > >> this needs a deep investigation). > > The proper direction to move there is to designate the vnode lock for > > the vnode structure protection, and have the range lock protect the > > i/o atomicity. This is somewhat done in the proposed patch (since > > now vnode lock does not protect the i/o operation, but only chunked > > i/o transactions inside the operation). > > > > The Jeff idea of using page cache as the source of i/o data (implemented > > in the VM6 patchset) pushes the idea much further. E.g., the write > > does not obtain the write vnode lock typically (but sometimes it had, > > to extend the vnode). > > > > Probably, I will revive VM6 after this change is landed. >=20 > About that I guess we might be careful. > The first thing would be having a very scalable VM subsystem and > recent benchmarks have shown that this is not yet the case (Florian, > CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also > with the vmcontention patch, shows a lot on contention on vm_object, > pmap lock and vm_page_queue_lock. We have some plans for every of > them, we will discuss on a separate thread if you prefer). This is > just to say, that we may need more work in underground areas to bring > VM6 to the point it will really make a difference. The benchmarks that were done at that time demonstrated that VM6 do not cause regressions for e.g. buildworld time, and have a margin improvements, around 10%, for some postgresql loads. Main benefit of the VM6 on UFS is that writers no longer block readers for separate i/o ranges. Also, due to vm_page flags locking improvements, I suspect the VM6 backpressure code might be simplified and give even larger benefit right now. Anyway, I do not think that VM6 can be put into HEAD quickly, and I want to finish with VM1/prefaulting right now. --hZWqkIq97iJ4fJXE Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9KPg4ACgkQC3+MBN1Mb4jgeQCgmjogiXqR8U7bZcOJ50tiEfb1 vi4An0XaOgTsNFD0GGIGbVqPw0kOUB+I =ykEh -----END PGP SIGNATURE----- --hZWqkIq97iJ4fJXE--